Linguistics 615
Corpus Linguistics
Spring 2013

Course goals Advances in computer technology have revolutionized the ways linguists can approach their data. By using computers, we can access large bodies of text (corpora) and search for the phenomena in which we are interested. In this way, we can uncover complexities in naturally-occurring data and explore issues related to frequency of usage.

In this course, we will investigate some of the following questions, among others: What is a corpus, and what corpora exist? How are corpora developed? How does one search for specific phenomena in corpora? What is a concordancer? Do we need syntactic annotation? Are there programs that do the annotation automatically? Are there tools that help search linguistically annotated corpora?

Meeting time: MW, 4:00-5:15pm

Classroom: Sycamore Hall (SY) 212
(We will likely move some sessions to the lab in Memorial Hall (MM) 401.)

Course website:

Assignments, slides, etc. will be posted here.

Credits: 3

Course prerequisites: Graduate student status, or permission of instructor.

Instructor: Markus Dickinson

Office: Memorial Hall (MM) 317

Phone: 856-2535

E-mail: (remove the nuisance bird)

Office hours: (at least for the first week)

R 11:00am-12:00pm
or by appointment

Assignments: There will be approximately one assignment every two weeks. These assignments give you the opportunity to practically explore the topics discussed in class.

Readings: There is a main required textbook we will use. Additionally, there will sometimes be readings available online.

Grading: Grades will be based on:

Assignments 54%(6@9% each)
Final project40%due 5pm on Mon., Apr. 29

Final projects Final projects will allow you to explore a research topic of your own interest and how corpus linguistic methods can enhance the research. More details will be given soon, as part of your assignments.

Perl programming Approximately every week, we will have a short lesson (20-30 minutes) on the Perl programming language. This language is useful for writing quick programs to process text, change data formats, access web data, as a front-end for language technology, etc. I assume no previous programming background.

500 Billion Words Workshop It just so happens that there is a Workshop happening April 18–20 here at IU which is relevant to this course, the What Can We Do With 500 Billion Words? workshop ( This is a great chance to hear various speakers talk on working with large data sets. Many talks are on Friday, April 19, so you should book part of that day to attend at least some of the workshop.

Academic Misconduct: Academic misconduct is not allowed in this course. The Indiana University Code of Student Rights, Responsibilities, and Conduct ( defines academic misconduct as “any activity that tends to undermine the academic integrity of the institution . . . Academic misconduct may involve human, hard-copy, or electronic resources . . . Academic misconduct includes, but is not limited to . . . cheating, fabrication, plagiarism, interference, violation of course rules, and facilitating academic misconduct” (II. G.1-6).

Students with Disabilities: Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations.

I rely on Disability Services for Students for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted Disability Services are encouraged to do so (812-855-7578;

Schedule We will mix practical sessions with more lecture-based sessions.





Corpus basics

Jan. 7

Intro to class & corpora at IU (.pdf, unix: .pdf, 2x3.pdf)


Why corpus linguistics? (.pdf, 2x3.pdf)

A1, B2


Basics (.pdf, 2x3.pdf)

A2, Biber (1993)

1 (.pl)

including: corpus building principles (.pdf, 2x3.pdf)

Lozano and Mendikoetxea (to appear)


Basic text analysis (concordancing) (.pdf, 2x3.pdf)


No class, MLK Day


Corpus annotation (.pdf, 2x3.pdf)

A3, A4

2 (.pl)


Available corpora (.pdf, 2x3.pdf)

A5, A7

#1 due

Annotation tools (.pdf, 2x3.pdf)

3 (.pl,

Feb. 4

Application #1: Language variation (.pdf, 2x3.pdf)

A10.4, B4


In-class practice (,


Exploiting corpora


Statistical analysis

A6, Jenset (2008b)


Multidimensional analysis (.pdf, 2x3.pdf)

4 (.pl)


R: basics (.pdf, 2x3.pdf) (fake-bigrams.txt)

Jenset (2008a)

#2 due

R: corpus linguistics


Application #2: Collocations (.pdf, 2x3.pdf)

B3, Pedersen et al. (2011)


In-class practice (.pdf, 2x3.pdf, .pl)


Mar. 4

Regular expressions (.pdf, 2x3.pdf)

5 (.pl)

Regular expressions

#3 due (.pl)


No class, Spring Break


No class, Spring Break


Application #3: Language learning (.pdf, 2x3.pdf)



In-class practice (scripts.tgz)


More information: linguistic annotation & web data


Syntactic annotation (.pdf, 2x3.pdf)

Marcus et al. (1993)

6 (.pl)

Syntactic searching (.pdf, 2x3.pdf)

Meurers (2005); Meurers and Müller (2007)

#4 due

Apr. 1

TIGERSearch & Tregex (.pdf, 2x3.pdf)

Levy and Andrew (2006)

7 (.pl)

Automatic annotation: POS & syntax (.pdf, 2x3.pdf)


Web as Corpus 1 (.pdf, 2x3.pdf)

Sharoff (2006)

8 (,,

Web as Corpus 2

Baroni and Bernardini (2004); Baroni and Kilgarriff (2006)

#5 due


Application #4: Translation (.pdf, 2x3.pdf)



In-class practice (



500-Billion Words Workshop)


Perl wrap-up + presentations

9 #6 due

Project presentations


Final papers due @ 5pm

Disclaimer This syllabus is subject to change. In fact, it will change as the course develops and as I figure out my travel schedule.


   Baroni, Marco and Silvia Bernardini (2004). BootCaT: Bootstrapping corpora and terms from the web. In Proceedings of LREC 2004.

   Baroni, Marco and Adam Kilgarriff (2006). Large linguistically-processed Web corpora for multiple languages. In Proceedings of EACL-06, Demonstration Session. Trento, Italy.

   Biber, Douglas (1993). Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257.

   Jenset, Gard B. (2008a). Basic R for corpus linguistics. Methods in linguistics workshop, August 19, 2008,

   Jenset, Gard B. (2008b). Basic statistics for corpus linguistics. Methods in linguistics workshop, August 19, 2008,

   Levy, Roger and Galen Andrew (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy.

   Lozano, Cristóbal and Amaya Mendikoetxea (to appear). Learner corpora and Second Language Acquisition: The design and collection of CEDEL2. In N. Ballier, A. Díaz-Negrillo and P. Thompson (eds.), Automatic Treatment and Analysis of Learner Corpus Data, Amsterdam: John Benjamins.

   Marcus, M., Beatrice Santorini and M. A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330.

   Meurers, Walt Detmar (2005). On the use of electronic corpora for theoretical linguistics. Case studies from the syntax of German. Lingua 115(1), 1619–1639.

   Meurers, Walt Detmar and Stefan Müller (2007). Corpora and Syntax (Article 44). In Anke Lüdeling and Merja Kytö (eds.), Corpus linguistics, Berlin: Mouton de Gruyter.

   Pedersen, Ted, Satanjeev Banerjee, Bridget McInnes, Saiyam Kohli, Mahesh Joshi and Ying Liu (2011). The Ngram Statistics Package (Text::NSP) : A Flexible Tool for Identifying Ngrams, Collocations, and Word Associations. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Portland, OR, pp. 131–133.

   Sharoff, Serge (2006). Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini (eds.), WaCky! Working papers on the Web as Corpus, Gedit, Bologna.