L715 (Ling) / B659 (CS)
Seminar: Detecting Latent User Properties in Text
Fall 2014

Course goals This seminar will discuss methods for uncovering hidden user properties in data on the basis of language use. That is, how can we discover properties about the writer based (solely) on the way they use language? There are a number of areas within natural language processing where this has been explored - impacting tasks such as information retrieval, sentiment analysis, and automated essay scoring - and we will try to unpack both the commonalities and the differences, in terms of features and (machine learning) techniques. Topics to be covered may include (but are not limited to): native language identification, gender identification, language proficiency classification, dialect identification, and authorship attribution & plagiarism detection. We will likely be exploring effects of confounding variables, such as genre and text type (e.g., social media vs. essays), as well as interacting effects of the topics listed above. The exact topics will vary to some extent depending upon student interest.

Students will be expected to complete a project combining research insight and implementation, as well as to lead the class discussion for some of the readings.

Meeting time: MW, 4:00-5:15pm

Classroom: Ballantine Hall (BH) 321

Course website: http://cl.indiana.edu/~md7/14/715/

Course resources will mostly be posted to this website. Some things I may put on oncourse, and I’ll let you know if that’s the case.

Credits: 3

Note for CS students: For CS MS students, this course can be counted towards Machine Learning specialization. For CS PhD students, this course can be counted towards core requirements as one of the two “any 500+ course”, or towards the AI minor. (Both are a one-time exception only for the fall 2014 co-listing.)

Course prerequisites: A course in computational linguistics or related area is recommended before taking this course. You will be expected to be fairly comfortable at programming.

Instructor: Markus Dickinson

Office: Memorial Hall (MM) 317

Phone: 856-2535

E-mail: md7@indiana.edu

Office hours:

R 11:00am–12:00pm
or by appointment

Readings: There will be weekly readings for discussion, most of which are available online. The exact schedule will depend upon a lot of factors, many of which are outlined next ...

Course requirements:

Academic Integrity: (from the Dean for Academic Standards and Opportunities)

Academic Integrity: As a student at IU, you are expected to adhere to the standards and policies detailed in the Code of Student Rights, Responsibilities, and Conduct (http://www.iu.edu/~code/). When you submit an assignment with your name on it, you are signifying that the work contained therein is all yours, unless otherwise cited or referenced. Any ideas or materials taken from another source for either written or oral use must be fully acknowledged. If you are unsure about the expectations for completing an assignment or taking a test or exam, be sure to seek clarification beforehand. All suspected violations of the Code will be handled according to University policies. Sanctions for academic misconduct may include a failing grade on the assignment, reduction in your final course grade, a failing grade in the course, among other possibilities, and must include a report to the Dean of Students who may impose additional disciplinary sanctions.

Students with Disabilities: Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations.

I rely on Disability Services for Students for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted Disability Services are encouraged to do so (812-855-7578; http://www.indiana.edu/~iubdss/).

(Tentative) Outline of Schedule: This will change a lot, based on what topics we find to be of most interest. Note that we’ll generally be moving from inherent properties of the writer (e.g., identity, gender, etc.) to changeable, in-flux properties (e.g., language proficiency). We may even uncover some new tasks in the process …

Approximate Dates

Approximate Topic

Non-Approx. Leader(s)

Aug. 25

Intro (.pdf, 2x3.pdf)

Aug. 27, Sep. 3

General text classification & machine learning (day 1: .pdf, 2x3.pdf, day 2: .pdf, 2x3.pdf)

Sep. 8, 10

NLP tools (tagging, parsing, SRL) (day 1: .pdf, 2x3.pdf, day 2: .pdf, 2x3.pdf)

Sep. 15, 17, 22, 24

Authorship attribution

Sep. 29, Oct. 1, 6

Deception & plagiarism detection

Oct. 8

Acquiring social media data

Oct. 8, 13, 15, 20, 22

Author profiling (e.g., sex)

& attribution in social media

Oct. 27, 29, Nov. 3

Dialect identification

Nov. 5, 10, 12, 17

Native language identification

Nov. 19, Dec. 1, 3

Language proficiency identification

Dec. 8, 10

Project reports


Scattered among these topics will probably be some “play-dates”, where we just take a day to play with data & tools, to see if we can make some progress on these topics.


Topics Your first assignment is going to be to browse the literature on a topic and find papers that you’re interested in reading. Details will be given in a separate handout.

Disclaimer This syllabus is subject to change. All important changes will be made in writing, with ample time for adjustment.