Towards Domain Adaptation for Parsing Web Data

Mohammad Khan, Markus Dickinson, and Sandra Kübler

Proceedings of the 9th Conference on Recent Advances in Natural Language Processing (RANLP 2013).

We improve upon a previous line of work for parsing web data, by exploring the impact of different decisions regarding the training data. First, we compare training on automatically POS-tagged data vs. gold POS data. Secondly, we compare the effect of training and testing within sub-genres, i.e., whether a close match of the genre is more important than training set size. Finally, we examine different ways to select out-of-domain parsed data to add to training, attempting to match the in-domain data in different shallow ways (sentence length, perplexity). In general, we find that approximating the in-domain data has a positive impact on parsing.


Electronically available file formats:


Bibtex entry:

@InProceedings{khan:ea:13b,
  author    = {Mohammad Khan and Markus Dickinson and Sandra K\"ubler},
  title     = {Towards Domain Adaptation for Parsing Web Data},
  booktitle = {Proceedings of the 9th Conference on Recent Advances in 
               Natural Language Processing (RANLP 2013)},
  year      = {2013},
  address   = {Hissar, Bulgaria},
  pages     = {},
  url       = {http://cl.indiana.edu/~md7/papers/khan-et-al13b.html}
}