Linguistics 615
Corpus Linguistics
Spring 2009

Course goals

Advances in computer technology have revolutionized the ways linguists can approach their data. By using computers, we can access large bodies of text (corpora) and search for the phenomena in which we are interested. In this way, we can uncover complexities in naturally-occurring data and explore issues related to frequency of usage.

In this course, the following questions will be investigated: What is a corpus? What corpora exist? How are corpora developed? What is XML? How does one search for specific phenomena in corpora? What is a concordancer? Do we need syntactic annotation? Are there programs that do the annotation automatically? Are there tools that help me search in linguistically annotated corpora?

Some details

Meeting time: TR 11:15am-12:30pm
Classroom: Ballantine Hall (BH) 317
Credits: 3
Course prerequisites: Graduate student status, or permission of instructor.

Instructor: Markus Dickinson
Office: Memorial Hall (MM) 317
Phone: 856-2535
E-mail: md7 ...AT... (remove our neighbor state)

Office hours: (at least for the first week)
M 11:00am
R 1:00pm
  or by appointment


There will be approximately one assignment every two weeks. These assignments give you the opportunity to practically explore the topics discussed in class.


There is a main required textbook we will use, plus one we will occasionally select readings from. Additionally, there will sometimes be readings available online.


Grades will be based on:

ASSIGNMENTS 50% (5@10% each)

Final projects

Final projects will allow you to explore a research topic of your own interest and how corpus linguistic methods can enhance the research. More details will be given sometime in February or early March.

Perl programming

Approximately every week, we will have a short lesson (20-30 minutes) on the Perl programming language. This language is useful for writing quick programs to process text, change data formats, access web data, as a front-end for language technology, etc. I assume no previous programming background.

Academic Misconduct:

Academic misconduct is not allowed in this course. The Indiana University Code of Student Rights, Responsibilities, and Conduct ( defines academic misconduct as ``any activity that tends to undermine the academic integrity of the institution . . . Academic misconduct may involve human, hard-copy, or electronic resources . . . Academic misconduct includes, but is not limited to . . . cheating, fabrication, plagiarism, interference, violation of course rules, and facilitating academic misconduct'' (II. G.1-6).

Students with Disabilities:

Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations.

I rely on Disability Services for Students for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted Disability Services are encouraged to do so (812-855-7578;


We will mix practical sessions with more lecture-based sessions.

Month Date Topic Readings Perl Assign.
Jan. 13 Intro to class      
  15 Why corpus linguistics? (.pdf, -2x3.pdf) A1, B2    
  20 Corpora at IU (handout; unix)   1 (code)  
  22 NO CLASS, I'M AWAY      
  27 NO CLASS, I'M AWAY      
  29 Basics (.pdf, -2x3.pdf) A2, SM17 2 (code)  
Feb. 3 Corpus annotation (.pdf, -2x3.pdf) A3, A4    
  W: 4 Available corpora (.pdf, -2x3.pdf) A5, A7    
  5 Corpus annotation   3 (code) #1 due
  10 Application #1: Language variation (.pdf, -2x3.pdf) A10.4, B4    
  W: 11 In-class practice ( C2    
  12 More on annotation: POS/Syntax (.pdf, -2x3.pdf) SM21    
  17 More on annotation: Reliability (.pdf, -2x3.pdf) SM29, Dickinson and Meurers (2003)    
  19 Regular expressions (.pdf, -2x3.pdf)   4 (code, palindromes.txt)  
  24 Regular expressions     #2 due
26 Statistics A6    
Mar. 3 Application #2: Collocations (.pdf, -2x3.pdf) B3, SM22 5 (code)  
  W: 4 In-class practice ( C1    
  5 Annotation tools (.pdf, -2x3.pdf) SM39    
  10 NO CLASS, I'M AWAY      
  12 NO CLASS, I'M AWAY     #3 due (
  24 Application #3: Language learning (.pdf, -2x3.pdf) B6    
  W: 25 In-class practice (,, C3    
  26 Automatic annotation: POS & syntax (.pdf, -2x3.pdf)   6 (code)  
  31 NO CLASS, I'M AWAY      
Apr. 2 NO CLASS, I'M AWAY      
  7 Syntactic annotation (.pdf, -2x3.pdf) SM37, SM25    
  W: 8 Syntactic searching (.pdf, -2x3.pdf, handout) Meurers and Müller (2007); Meurers (2005)    
  9 Linguist's Search Engine Resnik and Elkiss (2005); Resnik et al. (2005) 7 (code) #4 due
  14 Web as Corpus 1 (.pdf, -2x3.pdf, SM42    
  16 Web as Corpus 2 Baroni and Kilgarriff (2006); Sharoff (2006); Baroni and Bernardini (2004) 8 (code)  
  21 Application #4: Translation (.pdf, -2x3.pdf) B5    
  W: 22 In-class practice ( C6    
  23 Semantic annotation (.pdf, -2x3.pdf) Burchardt et al. (2006); Palmer et al. (2000)    
  28 Multidimensional analysis (.pdf, -2x3.pdf)      
  30 Perl wrap-up   9 (code: a, b, c) #5 due
May 5 Final papers/presentations (10:15am)      

Topic Readings
More on annotation: POS/Syntax SM21
More on annotation: Reliability SM29, Dickinson and Meurers (2003)
Annotation tools SM39
Syntactic annotation SM37, SM25
Syntactic searching Meurers and Müller (2007); Meurers (2005)
Semantic annotation Burchardt et al. (2006); Palmer et al. (2000)


This syllabus is subject to change. In fact, it most likely will change as the course develops.


Baroni, Marco and Silvia Bernardini (2004).
BootCaT: Bootstrapping corpora and terms from the web.
In Proceedings of LREC 2004.

Baroni, Marco and Adam Kilgarriff (2006).
Large linguistically-processed Web corpora for multiple languages.
In Proceedings of EACL-06, Demonstration Session. Trento, Italy.

Burchardt, Aljoscha, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Pado and Manfred Pinkal (2006).
The SALSA corpus: a German corpus resource for lexical semantics.
In Proceedings of LREC-06. Genoa.

Dickinson, Markus and W. Detmar Meurers (2003).
Detecting Errors in Part-of-Speech Annotation.
In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03). Budapest, Hungary, pp. 107-114.

Meurers, Walt Detmar (2005).
On the use of electronic corpora for theoretical linguistics. Case studies from the syntax of German.
Lingua 115(1), 1619-1639.

Meurers, Walt Detmar and Stefan Müller (2007).
Corpora and Syntax (Article 44).
In Anke Lüdeling and Merja Kytö (eds.), Corpus linguistics, Berlin: Mouton de Gruyter.

Palmer, Martha, Hoa Trang Dang and Joseph Rosenzweig (2000).
Sense Tagging the Penn Treebank.
In Proceedings of the Second Language Resources and Evaluation Conference, LREC-00. Athens.

Resnik, Philip and Aaron Elkiss (2005).
The Linguist's Search Engine: An Overview.
In Proceedings of the ACL Interactive Poster and Demonstration Sessions. Ann Arbor, Michigan: Association for Computational Linguistics, pp. 33-36.

Resnik, Philip, Aaron Elkiss, Ellen Lau and Heather Taylor (2005).
The Web in Theoretical Linguistics Research: Two Case Studies Using the Linguist's Search Engine.
In 31st Meeting of the Berkeley Linguistics Society. pp. 265-276.

Sharoff, Serge (2006).
Creating general-purpose corpora using automated search engine queries.
In Marco Baroni and Silvia Bernardini (eds.), WaCky! Working papers on the Web as Corpus, Gedit, Bologna.