Review of Carbone et al. 2004
Carbone, M., Gal, Y., Shieber, S., and Grosz, B. (2004). Unifying annotated discourse hierarchies to create a gold standard. In Proceedings of 4th SIGDIAL Workshop on Discourse and Dialogue.
online at Citeseer
online at Google Scholar
This paper discusses work attempting to create more authoritative discourse annotations by automatically combining annotations produced by a few human annotators. It uses the Boston Directions Corpus, which has discourse annotations based on Grosz and Sidner’s theory. They note in the introduction that work has focused on LDS over HDS because of the difficulties in evaluating HDS: annotation is more difficult and takes more time, annotation is more subjective (i.e. lower inter-annotator agreement), and it is unclear what metric to use for measuring agreement or similarity between two annotations. They consider 5 different methods of automatically combining annotations — consensus (full agreement) with and without hierarchical information, majority consensus with and without hierarchical information, and conflict-free union — and compare them to complete union and taking the best single annotation as measured by inter-annotator agreement. They then evaluate the original annotations against the unified annotation, using kappa, recall, precision, and non-crossing brackets. As is typically the case, the methods with high recall had low precision. The high recall methods had high kappa and the high precision methods had high non-crossing brackets scores. The conflict-free union method and flat majority consensus did well on kappa and recall metrics, but the authors suggest the hierarchical majority consensus is better for the purpose of having a high precision, hierarchical gold standard.
They compare the combined annotation with the contributing annotations instead of with other annotations created for the purpose “for the sake of scientific validity”. However, considering that only three annotations are used to create the the gold standard, the similarities used here are probably artificially high, like testing on training data. This method gives no indication of how applicable these comparisons are to other annotations. For example, it would have been better to compare these against similarly prepared annotations unified from the non-specialist annotations.