Detecting Inconsistencies in Treebanks

Markus Dickinson and Walt Detmar Meurers

Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003). Växjö, Sweden.

This paper discusses an automatic, data-driven approach to treebank error detection. The approach adapts the use of so-called variation n-grams as defined in Dickinson and Meurers (2003) for the detection of inconsistent part-of-speech annotations to syntactic annotation. The underlying idea is to define a consistency test for the mapping from recurring strings to their syntactic annotation. The paper illustrates with a case study based on the WSJ treebank that the method successfully detects inconsistencies in syntactic category annotation. Since such inconsistencies are typically introduced by humans, our method works best for large corpora that have been annotated manually or semi-automatically, which is generally the case for current syntactic and other high-level annotation.

Our work serves two main purposes for treebank improvement. It is a means for finding erroneous variation in a corpus, which can then be corrected. And it provides feedback for the development of empirically adequate standards for syntactic annotation, showing which distinctions are difficult to maintain over an entire corpus. Additionally, as a method for comparing syntactic annotation, our work could have uses for interannotator agreement testing and parser evaluation.

The code used for the paper is freely available. Simply go to: the DECCA software page