Markus's Budapest paper:

Background

There are these large bodies of texts (called corpora or in singular form corpus [from Latin for body]) which are marked up with part of speech tags. These tags are basically just what you've seen in school: verb, noun, adjective. Sometimes they're more specific (like past tense verb or proper noun) and the tags aren't the same for every corpus, but that's the general idea. Usually, since past tense verb takes too long to write out, people make up labels to stand for past tense verb like VBD (mnemonically, "VerB in its -eD [past tense] form").

And people use these texts to train part-of-speech taggers (among other things). These taggers can then tag new text which doesn't have any part of speech labels, like this file for instance. A tagger would go through and label this file like so:

And/CONJ people/NOUN use/VBPREZ these/PRO-MOD ...

(I just made up the tags on the spot: CONJ = conjunction, NOUN = noun, VBPREZ = present tense verb, PRO-MOD = a PROnoun type of thing that MODifies a noun.) So, this tagger could do this automatically, which is nice because we might want to add part of speech tags to the complete works of Mark Twain, and most of us don't have that kind of time. (That would be like taking eighth-grade grammar for the rest of your life.)

And once we have this part of speech information, we can use it to help us write dictionaries or to help us figure out what the sentence means. And if we know what it means, we could maybe translate it into another language or use that meaning to figure out what flight you want booked or whatever. In short: to do more complex stuff, it's extremely useful to first figure out what part of speech a word is.

But if taggers are going to correctly mark up a text with part of speech labels, they need to know how to assign labels in the first place. That is, they need to see well-marked-up corpora first in order to be able to learn that the is an article for example or that in some contexts shares is a noun (Jimmy bought 40 shares of Microsoft.), but in others it's a verb (Jimmy shares his toys with the other children.).

Even more challenging are words like back, which can be a noun (My back hurts.), an adjective (Use the back door please.), a verb (He tried to back out of the deal.), or an adverb-type thing (Every time I think I'm out, they pull me back in.)

But we don't need to talk about how taggers actually do this. The important thing is that they need good corpora to work with. If the corpus always has back as a noun and that's the only (often wrong) information a tagger has, then the tagger is going to make a lot of mistakes. And then we're screwed.

Just to give you an idea of what a corpus looks like, here's an example sentence from one corpus where each word has a part of speech tag. (I added the brackets to explain these tags.)

Pierre NNP [proper noun]
Vinken NNP
, ,
61 CD [cardinal number]
years NNS [plural common noun]
old JJ [adjective]
, ,
will MD [modal -- "helping verb"]
join VB [verb]
the DT [determiner -- i.e. article]
board NN [singular common noun]
as IN [preposition]
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .

So, Pierre and Vinken are proper nouns, a comma is a comma, 61 is a number, and so on.

Okay, now these corpora are huge -- the corpus this particular example sentence comes from has something like 50,000 sentences and 1.2 million words. And when you have that large of a text, there are bound to be problems -- namely, errors in the tags. Why? Because, to get this whole corpus marked up, they had to hire people to mark it up over a long period of time. And, in case you didn't know, people are prone to errors.

For example, in this corpus "the" is tagged once as a noun and once as a verb. Obviously, somebody was sleeping on the job.

The Actual Paper

That's all background. Here's where our paper comes in. ("Our" because it was written by me and my advisor Detmar Meurers.)

So, what we do is try to find the errors in the corpus. With 1.2 million words, we can't just go through by hand and find the errors. Even if it takes only 1 second for us to check each word, that's about 139 days. And these corpora are getting larger all the time. There's one that I know of that is 100 million words -- I think that works out to be 31 years, if it takes only 1 second per word. I may or may not be done with my PhD by then, but I'd prefer to spend my time doing something else.

So, we need some sort of method to automatically find errors. And that's what our paper does. Even if we can't say for sure what the right tag should be, being able to point to where something is wrong is a huge gain. And being able to do it in a day instead of 31 years is quite a nice improvement.

The paper is based on the following idea: if a string of words appears multiple times in the corpus but with different annotations, the longer the string of words, the more likely it's an error. I'll say this again because it's the main point: (long) repeated stretches of words have no reason to have different parts of speech. That would mean we might encounter a sentence like the following:

John is currently spending thirty/noun days in the hole .
And 3000 sentences later, we find (because it just happens to be the catchphrase of the day in my hypothetical world where Humble Pie songs are constantly alluded to -- real examples to follow):
John is currently spending thirty/adjective days in the hole .
Even if I can't decide if thirty is a noun or an adjective, I know there's no reason it should differ between these two sentences. You may not think it's a brilliant idea, but no one else has done it that we could find. Plus, it seems to work. So, that's nice.

So, for example, shares really could be a noun or a verb, but if it appears inside the same 134 words which appear twice in the corpus -- once as a verb, once as a noun -- then it is likely that one of these is wrong. Or to take a real example, sold could be a past tense verb (VBD) or a past participle (VBN) (i.e. the difference between John sold/VBD his house. and John has sold/VBN his house.) But when it appears as the 84th word in a stretch of 103 words that appears 3 times in the corpus, it's definitely one or the other.

Results of the Monday , October 30 , 1989 , auction of short-term U.S. government bills , sold at a discount from face value in units of $ 10,000 to $ 1 million

Whatever the right tag is (past participle (VBN) in this case), there's no reason we would tag it one way one time and another way another time.

As you can also see, this corpus is quite fascinating! (It's old Wall Street Journal text from the late 1980s/early 1990s.)

And that's basically it. Well, there's a lot more to it, but that's the basic insight. Here's a shorter real example. The following stretch of 10 words appears 32 times in the corpus:

. In New York Stock Exchange composite trading yesterday ,

Out of these 32 times, the word composite appears 24 times as an adjective and 8 times as a noun. I don't remember which one is correct, but it's clear that one of them is wrong. (I think adjective is right.) That is, there's no reason to think composite is a noun sometimes in composite trading and sometimes an adjective.

The longest such stretch is 224 words (it appears twice in the corpus), and it has 10 different errors in it -- that was kind of surprising, but I guess newspapers can be repetitive.

I think we also have a cool way to calculate these long stretches, but I think the paper is rather clear on that point. And by now you've graduated to linguistics research paper level.

So, now when someone asks, "What was Markus' paper about?", you can say, "I don't know, something about words." But I did what I could.