Problems in Evaluating Grammatical Error Detection Systems

Martin Chodorow, Markus Dickinson, Ross Israel, and Joel Tetreault

Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012).

Many evaluation issues for grammatical error detection have previously been overlooked, making it hard to draw meaningful comparisons between different approaches, even when they are evaluated on the same corpus. To begin with, the three-way contingency between a writer's sentence, the annotator's correction, and the system's output makes evaluation more complex than in some other NLP tasks, which we address by presenting an intuitive evaluation scheme. Of particular importance to error detection is the skew of the data -- the low frequency of errors as compared to non-errors -- which distorts some traditional measures of performance and limits their usefulness, leading us to recommend the reporting of raw measurements (true positives, false negatives, false positives, true negatives). Other issues that are particularly vexing for error detection focus on defining these raw measurements: specifying the size or scope of an error, properly treating errors as graded rather than discrete phenomena, and counting non-errors. We discuss recommendations for best practices with regard to reporting the results of system evaluation for these cases, recommendations which depend upon making clear one's assumptions and applications for error detection. By highlighting the problems with current error detection evaluation, the field will be better able to move forward.

Electronically available file formats:

Bibtex entry:

  author    = {Chodorow, Martin  and  Dickinson, Markus  and  Israel, Ross  and  Tetreault, Joel},
  title     = {Problems in Evaluating Grammatical Error Detection Systems},
  booktitle = {Proceedings of COLING 2012},
  month     = {December},
  year      = {2012},
  address   = {Mumbai, India},
  pages     = {611--628},
  url       = {}