Error detection and correction in annotated corpora

Markus Dickinson

PhD Thesis.

Building on work showing the harmfulness of annotation errors for both the training and evaluation of natural language processing technologies, this thesis develops a method for detecting and correcting errors in corpora with linguistic annotation. The so-called variation $n$-gram method relies on the recurrence of identical strings with varying annotation to find erroneous mark-up.

We show that the method is applicable for varying complexities of annotation. The method is most readily applied to positional annotation, such as part-of-speech annotation, but can be extended to structural annotation, both for tree structures---as with syntactic annotation---and for graph structures---as with syntactic annotation allowing discontinuous constituents, or crossing branches.

Furthermore, we demonstrate that the notion of variation for detecting errors is a powerful one, by searching for grammar rules in a treebank which have the same daughters but different mothers. We also show that such errors impact the effectiveness of a grammar induction algorithm and subsequent parsing.

After detecting errors in the different corpora, we turn to correcting such errors, through the use of more general classification techniques. Our results indicate that the particular classification algorithm is less important than understanding the nature of the errors and altering the classifiers to deal with these errors. With such alterations, we can automatically correct errors with 85% accuracy. By sorting the errors, we can relegate over 20% of them into an automatically correctable class and speed up the re-annotation process by effectively categorizing the others.

Electronically available file formats:

For questions or comments regarding this page, please contact: Markus Dickinson