L415 / L615
Corpus Linguistics
Fall 2015

Course goals Advances in computer technology have revolutionized the ways linguists can approach their data. By using computers, we can access large bodies of text (corpora) and search for the phenomena in which we are interested. In this way, we can uncover complexities in naturally-occurring data and explore issues related to frequency of usage.

In this course, we will investigate some of the following questions, among others: What is a corpus, and what corpora exist? How are corpora developed? How does one search for specific phenomena in corpora? What is a concordancer? Do we need syntactic annotation? Are there programs that do the annotation automatically? Are there tools that help search linguistically annotated corpora?

Meeting time: MW, 9:30–10:45am

Classroom: Lindley Hall (LH) 025
(We may move some sessions to the lab in Memorial Hall (MM) 401.)

Course website: http://cl.indiana.edu/~md7/15/615/

Assignments, slides, etc. will be posted here. (I only use oncourse for emails & semi-restricted data.)

Credits: 3

Course prerequisites: None

Instructor: Markus Dickinson

Office: Memorial Hall (MM) 317

Phone: 856-2535

E-mail: md7@fantastic.mr.indiana.edu (remove the parts worth removing)

Office hours: (at least for the first week)

R 11:00am-12:00pm
or by appointment

Assignments: There will be approximately one assignment every two weeks. These assignments give you the opportunity to practically explore the topics discussed in class.

Readings: There is a main required textbook we will use. Additionally, there will sometimes be readings available online or through oncourse.

There are some other recommended readings, depending upon your interests:

Grading: Grades will be based on:

Assignments 49%(7@7% each)
Final project40%due 5pm on Wed., Dec. 16

Graduate requirements Graduate students will be required to do additional work on the assignments, as well as to have a more in-depth project. Specific details will be given as appropriate.

Final projects Final projects will allow you to explore a research topic of your own interest and how corpus linguistic methods can enhance the research. More details will be given soon, as part of your assignments.

Perl programming Approximately every week, we will have a short lesson (20-30 minutes) on the Perl programming language. This language is useful for writing quick programs to process text, change data formats, as a front-end for language technology, etc. I assume no previous programming background.

Corpur Linguistics Fest (CLiF) Although this doesn’t really affect our course this semester, note that in summer 2016 (June 6–10), IU will host the Corpus Linguistics Fest (CLiF), which will consist of four days of courses & practical sessions, followed by a day of posters. Stephanie Dipper, Anke Lüdeling, and Amir Zeldes are slated as speakers, and, if you have a nice course project, it might make for a suitable poster. Tell your friends who made the silly mistake of not signing up for this class!

Academic Integrity (from the Dean for Academic Standards and Opportunities): As a student at IU, you are expected to adhere to the standards and policies detailed in the Code of Student Rights, Responsibilities, and Conduct (http://www.iu.edu/~code/) When you submit an assignment with your name on it, you are signifying that the work contained therein is all yours, unless otherwise cited or referenced. Any ideas or materials taken from another source for either written or oral use must be fully acknowledged. If you are unsure about the expectations for completing an assignment or taking a test or exam, be sure to seek clarification beforehand. All suspected violations of the Code will be handled according to University policies. Sanctions for academic misconduct may include a failing grade on the assignment, reduction in your final course grade, a failing grade in the course, among other possibilities, and must include a report to the Dean of Students who may impose additional disciplinary sanctions.

Students with Disabilities: Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations.

I rely on Disability Services for Students for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted Disability Services are encouraged to do so (812-855-7578; http://www.indiana.edu/~iubdss/).

(Tentative) Schedule We will mix practical sessions with more lecture-based sessions, and the contents may change a bit as we discover what’s working for us and what’s not. (Also, I’m 90% sure my time estimates are way off.)





Aug. 24

Intro to class & corpora at IU (.pdf)


Why corpus linguistics? (.pdf, 2x3.pdf)

ch. 1, Teubert (2005)


Basics (.pdf, 2x3.pdf)

Biber (1993)

including: corpus building principles (.pdf, 2x3.pdf)

Lozano and Mendikoetxea (2013)

Sep. 2

Corpus building: XML, etc. (.pdf, 2x3.pdf)

1 (.pl)


Labor Day, no classes


Basic text analysis (concordancing) (.pdf, 2x3.pdf)

ch. 9

2 (.pl) #1 due


Corpus annotation (.pdf, 2x3.pdf)

ch. 2


Available corpora (.pdf, 2x3.pdf) & tools (.pdf, 2x3.pdf)

3 (.pl)


Regular expressions (.pdf, 2x3.pdf) (help.pl)

ch. 10


Regular expressions

#2 due


Searching word forms (.pdf, 2x3.pdf) (handout)

ch. 11

4 (.pl, .txt)

Collocations (.pdf, 2x3.pdf)

sec. 11.4, Pedersen et al. (2011)

Oct. 5

In-class practice (.pdf, 2x3.pdf)


Statistical analysis (.pdf, 2x3.pdf) (handout from Stephanie Dickinson)

Gries (2010)


R: basics (.pdf, 2x3.pdf)

Jenset (2015)

#3 due

R: corpus linguistics


In-class practice

5 (.pl)


Annotation: motivation (.pdf, 2x3.pdf)

ch. 7

#4 due (Perl starter)


Annotation: limitations

sec. 7.3


Annotation: example uses (.pdf, 2x3.pdf)

ch. 8

Nov. 2

Word-level annotation

ch. 3

6 (a, b, c, d)

Morphosyntactic searching

ch. 11 (redeux)

#5 due


Syntactic annotation

ch. 4


Syntactic searching (.pdf, 2x3.pdf)

ch. 12, Meurers (2005); Meurers and Müller (2007)



TIGERSearch & Tregex

Levy and Andrew (2006)


Automatic annotation: POS & syntax (Part 1: .pdf, 2x3.pdf)

Code: transform.pl, Ref. slides: nlp.pdf

8 #6 due


Thanksgiving break, no classes


Thanksgiving break, no classes


Semantic annotation

ch. 5

Dec. 2

Web as Corpus

Sharoff (2006)


Perl wrap-up + presentations


Project presentations

#7 due


Final papers due @ 5pm

Disclaimer This syllabus is subject to change. In fact, it will change as the course develops and as I figure out my travel schedule.


   Biber, Douglas (1993). Representativeness in Corpus Design. Literary and Linguistic Computing 8(4), 243–257. http://staff.um.edu.mt/albert.gatt/teaching/dl/biber93.pdf.

   Gries, Stefan Th. (2010). Useful statistics for corpus linguistics. In Aquilino Sánchez and Moisés Almela (eds.), A mosaic of corpus linguistics: selected approaches, Frankfurt am Main: Peter Lang, pp. 269–291. http://www.linguistics.ucsb.edu/faculty/stgries/research/2010_STG_UsefulStats4CorpLing_MosaicCorpLing.pdf.

   Jenset, Gard B. (2015). Introducing R for Corpus Linguistics. May 26, 2015, http://www.academia.edu/12782310/Using_R_for_Corpus_Linguistics.

   Levy, Roger and Galen Andrew (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy. http://nlp.stanford.edu/pubs/levy_andrew_lrec2006.pdf.

   Lozano, Cristóbal and Amaya Mendikoetxea (2013). Learner corpora and Second Language Acquisition: The design and collection of CEDEL2. In N. Ballier, A. Díaz-Negrillo and P. Thompson (eds.), Automatic Treatment and Analysis of Learner Corpus Data, Amsterdam: John Benjamins. http://wdb.ugr.es/~cristoballozano/wp-content/uploads/Lozano-Mendikoetxea-2013-Learner-corpora-and-SLA-design-of-CEDEL2-version-web.pdf.

   Meurers, Walt Detmar (2005). On the use of electronic corpora for theoretical linguistics. Case studies from the syntax of German. Lingua 115(1), 1619–1639. http://ling.osu.edu/~dm/papers/meurers-03.html.

   Meurers, Walt Detmar and Stefan Müller (2007). Corpora and Syntax (Article 44). In Anke Lüdeling and Merja Kytö (eds.), Corpus linguistics, Berlin: Mouton de Gruyter. http://purl.org/net/dm/papers/meurers-mueller-07.html.

   Pedersen, Ted, Satanjeev Banerjee, Bridget McInnes, Saiyam Kohli, Mahesh Joshi and Ying Liu (2011). The Ngram Statistics Package (Text::NSP) : A Flexible Tool for Identifying Ngrams, Collocations, and Word Associations. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Portland, OR, pp. 131–133. http://www.aclweb.org/anthology-new/W/W11/W11-0821.pdf.

   Sharoff, Serge (2006). Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini (eds.), WaCky! Working papers on the Web as Corpus, Gedit, Bologna. http://wackybook.sslmit.unibo.it/pdfs/sharoff.pdf.

   Teubert, Wolfgang (2005). My version of corpus linguistics. International Journal of Corpus Linguistics 10(1), 1–13.