L715 (Ling) / B659 (CS)
Seminar: Author Profiling
Fall 2016

Course goals Author profiling, the task of detecting ”hidden” demographic user properties (gender, age, personality, native language, etc.) based on language use, has grown in interest in recent years, especially for social media. This seminar will: a) explore the recent literature on the topic, and b) build systems to predict author properties. In addition to investigating and developing techniques for author profiling tasks, we will likely emphasize topics such as the acquisition of reliable data, connections to linguistically motivated features, and multilinguality.

Meeting time: TR, 2:30–3:45pm

Classroom: Cedar Hall (AC) C107

Course website: http://cl.indiana.edu/~md7/16/715/

Course resources will mostly be posted to this website. Some things I may put on Canvas, and I’ll let you know if that’s the case.

Credits: 3

Course prerequisites: A course in CL/NLP. Permission of instructor required.

Instructor: Markus Dickinson

Office: Ballantine Hall (BH) 851

Phone: 856-2535

E-mail: md7@indianamarco-polo.edu (remove explorer)

Office hours:

M10:00–11:00am
R 11:00am–12:00pm
or by appointment

Readings: There will be weekly readings for discussion, most of which are available online. The exact schedule will depend upon a lot of factors, many of which are outlined next ...

Course requirements:

Academic Integrity: (from the Dean for Academic Standards and Opportunities)

“As a student at IU, you are expected to adhere to the standards and policies detailed in the Code of Student Rights, Responsibilities, and Conduct (http://studentcode.iu.edu). When you submit an assignment with your name on it, you are signifying that the work contained therein is yours, unless otherwise cited or referenced. Any ideas or materials taken from another source for either written or oral use must be fully acknowledged. All suspected violations of the Code will be reported to the Dean of Students and handled according to University policies. Sanctions for academic misconduct may include a failing grade on the assignment, reduction in your final course grade, and a failing grade in the course, among other possibilities. If you are unsure about the expectations for completing an assignment or taking a test or exam, be sure to seek clarification beforehand.”

Students with Disabilities: Students who need an accommodation based on the impact of a disability should contact me to arrange an appointment as soon as possible to discuss the course format, to anticipate needs, and to explore potential accommodations.

I rely on Disability Services for Students for assistance in verifying the need for accommodations and developing accommodation strategies. Students who have not previously contacted Disability Services are encouraged to do so (812-855-7578; http://www.indiana.edu/~iubdss/).

CAPS One benefit of a school like IU is that there are many, many resources available to you. School—and life—can be intense at times, and if your academic responsibilities or other personal concerns are distracting or weighing on you this semester, I encourage you to contact Counseling and Psychological Services (CAPS, 812-855-5711, http://healthcenter.indiana.edu/counseling/). The people there can be a resource and a source of support, not just in times of crisis but also when you need an extra ear or a little extra support. I’m happy to be a listening ear, as well, but I have no counseling training and the folks at CAPS do. Note, too, that I am required to report certain things (e.g., reports of sexual assault, suicidal thoughts).

(Tentative) Outline of Schedule: This will change a lot, based on what topics we find to be of most interest. We may even uncover some new tasks in the process …

Approximate Dates

Approximate Topic

Non-Approx. Leader(s)



Aug. 23

Intro (.pdf, 2x3.pdf)

Markus
Aug. 25, 30

General text classification & machine learning (.pdf, 2x3.pdf, practical: .pdf, 2x3.pdf)

Markus
Sep. 1, 6, 8

NLP tools (tagging, parsing, SRL) (.pdf, 2x3.pdf, practical: .pdf, 2x3.pdf)

Markus
Sep. 13, 15

Acquiring social media data (.pdf, 2x3.pdf)

Markus
Sep. 20

Background: Authorship attribution (Stamatatos 2009, Stamatatos et al. 2015)

Yasmeen
Sep. 22

Authorship attribution: extensions (Afroz et al 2012, Caliskan-Islam et al 2015)

Xiaorui
Sep. 27, Oct. 4 (no class Sep. 29)

Gender & Age prediction (Schler et al. 2006, Sarawgi et al. 2011, Rangel et al. 2014)

Nishant
Oct. 6, 11

Gender & Age prediction (Rangel et al. 2016, Bayot & Goncalves 2016, Bougiatiotis & Krithara 2016, Rangel & Rosso 2016)

Sarah
Oct. 11, 13

Gender & Age prediction (Mukherjee & Liu 2010, Weren et al 2014)

Yue
Oct. 13, 18

Gender & Age prediction (Sap et al 2014, Bamman et al 2014, Rosenthal & McKeown 2011)

Wen
Oct. 20, 25

Location/Dialect identification (Eisenstein et al. 2010, Bo et al. 2012, Zampieri et al. 2015)

Vanessa
Oct. 25, 27

Deception detection (Mukherjee et al. 2013, Newman et al. 2003)

Nikita
Oct. 27, Nov. 1

Personality classification (Oberlander & Nowson 2006, Schwartz et al. 2013, Bachrach et al 2012)

Atreyee
Nov. 3, 8

Personality classification (Golbeck et al 2011, Noecker et al 2013, Alm et al 2005)

Misato
Nov. 8, 10

Political affiliation identification (Pennacchiotti & Popescu 2011b, Maynard & Funk 2011, Zamal et al 2012)

Noah
Nov. 10, 15

Political affiliation identification (Pennacchiotti & Popescu 2011a, Conover et al 2011a, Volkova et al 2014)

Pranav
Nov. 17, 29

Political affiliation identification (O'Connor et al 2010b, Pla & Hurtado 2014)

Mike
Nov. 29, Dec. 1

Native language identification (Bykh & Meurers 2012, Bykh & Meurers 2014, Malmasi & Cahill 2015)

Noor
Dec. 1, 6

Native language identification (Malmasi & Dras 2014, Elfardy & Diab 2013, Gebre et al. 2013)

Inas
Dec. 6, 8

Project reports

All

Scattered among these topics will probably be some “play-dates”, where we just take a day to play with data & tools, to see if we can make some progress on these topics.

Assignments

Disclaimer This syllabus is subject to change. All important changes will be made in writing, with ample time for adjustment.

Some Possible Readings

This is not a comprehensive list, but rather a list of useful pointers. Follow up on references, authors, topics, etc. and see what turns up. It wouldn’t hurt to browse the PAN writeups (see below), the ACL Anthology (http://aclweb.org/anthology/), the ACM Digital Library (http://dl.acm.org), as well as using your favorite internet search engine and the good old library. See also: the Association for Computers and the Humanities: http://ach.org & the European Association for Digital Humanities: http://eadh.org.

Note that chapter 2 of Volkova (2015) provides a nice description of what many of these readings show; see especially Tables 2.1–2.3 on p. 34–36. Also, note that the PAN challenges have lots of notebooks/writeups associated with specific competing systems; see, in particular: http://pan.webis.de/clef15/pan15-web/proceedings.html

Authorship attribution Stamatatos et al. (2015); Juola (2008); Koppel et al. (2009); Stamatatos (2009); Koppel and Winter (2014); Dinu and Nisioi (2012); Sapkota et al. (2015); Layton et al. (2012); Luyckx and Daelemans (2008); Raghavan et al. (2010); Grieve (2007)

Author profiling Argamon et al. (2009); Li et al. (2014); Pennebaker (2011); Preoţiuc-Pietro et al. (2016); Rangel et al. (20152013); Rao et al. (20112010); Volkova (2015); Bergsma et al. (2013); Bergsma and Van Durme (2013); Bergsma et al. (2012); Pennacchiotti and Popescu (2011b); Culotta et al. (2016); Singla and Richardson (2008)

Gender & age Bamman et al. (2014); Burger et al. (2011); M.Koppel et al. (2002); Nguyen et al. (201420132011); Ruths and Pfeffer (2014); Schler et al. (2006); Van Durme (2012); Zhang and Zhang (2010); Sap et al. (2014); Filippova (2012); Kokkos and Tzouramanis (2014); Hovy (2015); Rangel and Rosso (2016); Rangel et al. (20132015); Cheng et al. (20112009); Fink et al. (2012); Poulston et al. (2015); Liu and Ruths (2013); Rosenthal and McKeown (2011); Ciot et al. (2013); Peersman et al. (2011); Alowibdi et al. (2013); Goswami et al. (2009); Sarawgi et al. (2011); Mukherjee and Liu (2010)

Political affiliation & ideology Volkova et al. (2014); Zamal et al. (2012); Conover et al. (2011a,b); Pennacchiotti and Popescu (2011a); Cohen and Ruths (2013); Golbeck et al. (2010); Maynard and Funk (2011); Tumasjan et al. (2010); Gayo-Avello (2012); Lampos et al. (2013); Yano et al. (2013); Yano and Smith (2010); Golbeck and Hansen (2011); Wong et al. (2013); Diermeier et al. (2012)

Personality Jr et al. (2013); Schwartz et al. (2013); Bachrach et al. (2012); Golbeck et al. (2011); Argamon et al. (2005); Nowson and Oberlander (2007); Oberlander and Nowson (2006); Mairesse et al. (2007)

NLI Tetreault et al. (20132012); Malmasi and Dras (20152014); Wong and Dras (2011); Malmasi and Cahill (2015); Malmasi et al. (2015); Perkins (2015)

Other / Related work Kosinski et al. (2013); Yang and Eisenstein (2013); Preoţiuc-Pietro et al. (2015b,a); Sloan et al. (2015); Fang et al. (2015); Garera and Yarowsky (2009b,a); Volkova et al. (2013); O’Connor et al. (2010); Eisenstein et al. (2010); Mislove et al. (2010); Yang et al. (2011)

Tools: Tausczik and Pennebaker (2010); Bifet et al. (2011)

References

   Jalal S Alowibdi, Ugo A Buy, and Philip Yu. 2013. Empirical evaluation of profile characteristics for gender classification on twitter. In Proceedings of the 12th International Conference on Machine Learning and Applications (ICMLA), volume 1, pages 365–369. Miami, FL.

   S. Argamon, M. Koppel, J. Pennebaker, and J. Schler. 2009. Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119–123. URL http://u.cs.biu.ac.il/~koppel/papers/AuthorshipProfiling-cacm-final.pdf.

   Shlomo Argamon, Sushant Dhawle, Moshe Koppel, and James W. Pennebaker. 2005. Lexical predictors of personality type. In Proceedings of the 2005 Joint Annual Meeting of the Interface and the Classification Society of North America.

   Yoram Bachrach, Michal Kosinski, Thore Graepel, Pushmeet Kohli, and David Stillwell. 2012. Personality and patterns of facebook usage. In Proceedings of the ACM Web Science Conference (WebSci). Evanston, IL.

   David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135–160. URL http://dx.doi.org/10.1111/josl.12080.

   Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, and David Yarowsky. 2013. Broadly improving user classification via communication-based name and location clustering on twitter. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1010–1019. Association for Computational Linguistics, Atlanta, Georgia. URL http://www.aclweb.org/anthology/N13-1121.

   Shane Bergsma, Matt Post, and David Yarowsky. 2012. Stylometric analysis of scientific articles. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 327–337. Association for Computational Linguistics, Montréal, Canada. URL http://www.aclweb.org/anthology/N12-1033.

   Shane Bergsma and Benjamin Van Durme. 2013. Using conceptual class attributes to characterize social media users. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 710–720. Association for Computational Linguistics, Sofia, Bulgaria. URL http://www.aclweb.org/anthology/P13-1070.

   Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2011. Moa: Massive online analysis. Journal of Machine Learning Research, 11:1601–1604.

   John D. Burger, John Henderson, George Kim, and Guido Zarrella. 2011. Discriminating gender on twitter. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1301–1309. Edinburgh, Scotland, UK.

   Na Cheng, Rajarathnam Chandramouli, and K.P. Subbalakshmi. 2011. Author gender identification from text. Digital Investigation: The International Journal of Digital Forensics & Incident Response, 8(1):78–88. URL http://dx.doi.org/10.1016/j.diin.2011.04.002.

   Na Cheng, Xiaoling Chen, Rajarathnam Chandramouli, and K.P. Subbalakshmi. 2009. Gender identification from e-mails. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, pages 154–158. Nashville, TN.

   Morgane Ciot, Morgan Sonderegger, and Derek Ruths. 2013. Gender inference of Twitter users in non-English contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1136–1145. Association for Computational Linguistics, Seattle, Washington, USA. URL http://www.aclweb.org/anthology/D13-1114.

   Raviv Cohen and Derek Ruths. 2013. Classifying political orientation on twitter: Its not easy! In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM), pages 91–99. Ann Arbor, MI.

   Michael D. Conover, Bruno Gonçalves, Jacob Ratkiewicz, Alessandro Flammini, and Filippo Menczer. 2011a. Predicting the political alignment of twitter users. In Proceedings of Social Computing.

   Michael D. Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Gonçalves, Filippo Menczer, and Alessandro Flammini. 2011b. Political polarization on twitter. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 89–96. Barcelona.

   Aron Culotta, Nirmal Kumar Ravi, and Jennifer Cutler. 2016. Predicting twitter user demographics using distant supervision from website traffic data. Journal of Artificial Intelligence Research, 55:389–408.

   Daniel Diermeier, Jean-Franois Godbout, Bei Yu, and Stefan Kaufmann. 2012. Language and ideology in congress. British Journal of Political Science, 42:31–55. URL http://journals.cambridge.org/article_S0007123411000160.

   Liviu P. Dinu and Sergiu Nisioi. 2012. Authorial studies using ranked lexical features. In Proceedings of COLING 2012: Demonstration Papers, pages 125–130. The COLING 2012 Organizing Committee, Mumbai, India. URL http://www.aclweb.org/anthology/C12-3016.

   Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, and Eric P. Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1277–1287. Association for Computational Linguistics, Cambridge, MA. URL http://www.aclweb.org/anthology/D10-1124.

   Quan Fang, Jitao Sang, Changsheng Xu, and M. Shamim Hossain. 2015. Relational user attribute inference in social media. IEEE Transactions on Multimedia, 15(7):1–1.

   Katja Filippova. 2012. User demographics and language in an implicit social network. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1478–1488. Association for Computational Linguistics, Jeju Island, Korea. URL http://www.aclweb.org/anthology/D12-1135.

   Clay Fink, Jonathon Kopecky, and Maksym Morawski. 2012. Inferring gender from the content of tweets: A region specific example. In Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 459–462. Dublin.

   Nikesh Garera and David Yarowsky. 2009a. Modeling latent biographic attributes in conversational genres. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 710–718. Association for Computational Linguistics, Suntec, Singapore. URL http://www.aclweb.org/anthology/P/P09/P09-1080.

   Nikesh Garera and David Yarowsky. 2009b. Structural, transitive and latent models for biographic fact extraction. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 300–308. Association for Computational Linguistics, Athens, Greece. URL http://www.aclweb.org/anthology/E09-1035.

   Daniel Gayo-Avello. 2012. No, you cannot predict elections with twitter. IEEE Internet Computing, 16(6):91–94.

   Jennifer Golbeck, Justin M. Grimes, and Anthony Rogers. 2010. Twitter use by the u.s. congress. Journal of the American Society for Information Science and Technology, 61(8):1612–1621.

   Jennifer Golbeck and Derek Hansen. 2011. Computing political preference among twitter followers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, pages 1105–1108. ACM, New York.

   Jennifer Golbeck, Cristina Robles, Michon Edmondson, and Karen Turner. 2011. Predicting personality from twitter. In Proceedings of the 2011 IEEE International Conference on Privacy, Secutiry, Risk, and Trust, and IEEE International Conference on Social Computing, pages 149–156. Boston.

   Sumit Goswami, Sudeshna Sarkar, and Mayur Rustagi. 2009. Stylometric analysis of bloggers’ age and gender. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media (ICWSM), pages 214–217. San Jose.

   Jack Grieve. 2007. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3):251–270.

   Dirk Hovy. 2015. Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 752–762. Association for Computational Linguistics, Beijing, China. URL http://www.aclweb.org/anthology/P15-1073.

   John Noecker Jr, Michael Ryan, and Patrick Juola. 2013. Psychological profiling through textual analysis. Literary and Linguistic Computing, 28(3):382–387.

   Patrick Juola. 2008. Authorship attribution. Foundations and Trends in Information Retrieval, 1(3):233–334.

   Athanasios Kokkos and Theodoros Tzouramanis. 2014. A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday, 19(9).

   Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology (JASIST), 60(1):9–26.

   Moshe Koppel and Yaron Winter. 2014. Determining if two documents are written by the same author. Journal of the Association for Information Science and Technology (JASIST), 65(1):178–187. URL http://dx.doi.org/10.1002/asi.22954.

   Michal Kosinski, David Stillwell, and Thore Graepel. 2013. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15):5802–5805.

   Vasileios Lampos, Daniel Preoţiuc-Pietro, and Trevor Cohn. 2013. A user-centric model of voting intention from social media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 993–1003. Association for Computational Linguistics, Sofia, Bulgaria. URL http://www.aclweb.org/anthology/P13-1098.

   Robert Layton, Paul Watters, and Richard Dazeley. 2012. Recentred local profiles for authorship attribution. Natural Language Engineering, 18(3):293–312.

   Jiwei Li, Alan Ritter, and Eduard Hovy. 2014. Weakly supervised user profile extraction from twitter. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 165–174. Association for Computational Linguistics, Baltimore, Maryland. URL http://www.aclweb.org/anthology/P14-1016.

   Wendy Liu and Derek Ruths. 2013. Whats in a name? using first names as features for gender inference in twitter. In Proceedings of the 2013 AAAI Spring Symposium Series, pages 10–16. Palo Alto, CA.

   Kim Luyckx and Walter Daelemans. 2008. Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 513–520. Manchester, UK. URL http://www.aclweb.org/anthology/C08-1065.

   François Mairesse, Marilyn A. Walker Matthias R. Mehl, and Roger K. Moore. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research (JAIR), 30:457–500.

   Shervin Malmasi and Aoife Cahill. 2015. Measuring feature diversity in native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 49–55. Denver, CO. URL http://www.aclweb.org/anthology/W15-0606.

   Shervin Malmasi and Mark Dras. 2014. Language transfer hypotheses with linear svm weights. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1385–1390. Doha, Qatar. URL http://www.aclweb.org/anthology/D14-1144.

   Shervin Malmasi and Mark Dras. 2015. Large-scale native language identification with cross-corpus evaluation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1403–1409. Denver, CO. URL http://www.aclweb.org/anthology/N15-1160.

   Shervin Malmasi, Joel Tetreault, and Mark Dras. 2015. Oracle and human baselines for native language identification. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 172–178. Denver, CO. URL http://www.aclweb.org/anthology/W15-0620.

   Diana Maynard and Adam Funk. 2011. Automatic detection of political opinions in tweets. In Proceedings of the 8th International Conference on The Semantic Web (ESWC), pages 88–99. Heraklion, Crete, Greece.

   Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, and Peter Druschel. 2010. You are who you know: Inferring user profiles in online social networks. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 251–260. ACM, New York, NY, USA.

   M.Koppel, S. Argamon, and A. Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401–412. URL http://u.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf.

   Arjun Mukherjee and Bing Liu. 2010. Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 207–217. Association for Computational Linguistics, Cambridge, MA. URL http://www.aclweb.org/anthology/D10-1021.

   Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. 2013. ”how old do you think i am?”; a study of language and age in twitter. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. AAAI Press, Palo Alto, CA.

   Dong Nguyen, Noah A. Smith, and Carolyn P. Rosé. 2011. Author age prediction from text using linear regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 115–123. Association for Computational Linguistics, Portland, OR, USA. URL http://www.aclweb.org/anthology/W11-1515.

   Dong Nguyen, Dolf Trieschnigg, A. Seza Doğruöz, Rilana Gravel, Mariet Theune, Theo Meder, and Franciska De Jong. 2014. Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1950–1961. Dublin, Ireland.

   Scott Nowson and Jon Oberlander. 2007. Identifying more bloggers: Towards large scale personality classification of personal weblogs. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM). Boulder, CO.

   Jon Oberlander and Scott Nowson. 2006. Whose thumb is it anyway? classifying author personality from weblog text. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 627–634. Sydney, Australia. URL http://www.aclweb.org/anthology/P/P06/P06-2081.

   Brendan O’Connor, Jacob Eisenstein, Eric P. Xing, and Noah A. Smith. 2010. A mixture model of demographic lexical variation. In Proceedings of the NIPS Workshop on Machine Learning for Social Computing. Vancouver.

   Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. 2011. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents (SMUC), pages 37–44. Glasgow.

   Marco Pennacchiotti and Ana-Maria Popescu. 2011a. Democrats, republicans and starbucks afficionados: User classification in twitter. In Proceedings of KDD 2011, pages 430–438.

   Marco Pennacchiotti and Ana-Maria Popescu. 2011b. A machine learning approach to twitter user classification. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 281–288. Barcelona.

   J. Pennebaker. 2011. The secret life of pronouns: What our words say about us. Bloomsbury Publishing, New York.

   Ria Perkins. 2015. Native language identification (nlid) for forensic authorship analysis of weblogs. In Maurice Dawson and Marwan Omar, editors, New threats and countermeasures in digital crime and cyber terrorism. IGI global.

   Adam Poulston, Mark Stevenson, and Kalina Bontcheva. 2015. Topic models and ngram language models for author profiling. In Notebook for PAN at CLEF 2015. Toulouse, France.

   Daniel Preoţiuc-Pietro, Vasileios Lampos, and Nikolaos Aletras. 2015a. An analysis of the user occupational class through twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1754–1764. Association for Computational Linguistics, Beijing, China. URL http://www.aclweb.org/anthology/P15-1169.

   Daniel Preoţiuc-Pietro, Svitlana Volkova, Vasileios Lampos, Yoram Bachrach, and Nikolaos Aletras. 2015b. Studying user income through language, behaviour and affect in social media. PLoS ONE, 10(9). E0138717. doi:10.1371/journal.pone.0138717.

   Daniel Preoţiuc-Pietro, Wei Xu, and Lyle Ungar. 2016. Discovering user attribute stylistic differences via paraphrasing. In Proceedings of AAAI 2016.

   Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers, pages 38–42. Uppsala, Sweden. URL http://www.aclweb.org/anthology/P10-2008.

   Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd author profiling task at pan 2015. In Linda Cappelato, Nicola Ferro, Gareth Jones, and Eric San Juan, editors, CLEF 2015 Labs and Workshops, Notebook Papers, CEUR Workshop Proceedings. Toulouse, France.

   Francisco Rangel and Paolo Rosso. 2016. On the impact of emotions on author profiling. Information Processing & Management, 52(1):73–92.

   Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. 2013. Overview of the author profiling task at pan 2013. In Proceedings of PAN at CLEF 2013. URL http://www.uni-weimar.de/medien/webis/research/events/pan-13/pan13-papers-final/pan13-author-profiling/rangel13-overview.pdf.

   Delip Rao, Michael Paul, Clay Fink, David Yarowsky, Timothy Oates, and Glen Coppersmith. 2011. Hierarchical bayesian models for latent attribute detection in social media. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 598–601. Barcelona.

   Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents (SMUC). Toronto.

   Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 763–772. Association for Computational Linguistics, Portland, Oregon, USA. URL http://www.aclweb.org/anthology/P11-1077.

   Derek Ruths and Jürgen Pfeffer. 2014. Social media for large studies of behavior. Science, 346(6213):1063–1064.

   Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and Hansen Andrew Schwartz. 2014. Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1146–1151. Association for Computational Linguistics, Doha, Qatar. URL http://www.aclweb.org/anthology/D14-1121.

   Upendra Sapkota, Steven Bethard, Manuel Montes, and Thamar Solorio. 2015. Not all character n-grams are created equal: A study in authorship attribution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–102. Denver, CO.

   Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 78–86. Portland, OR. URL http://www.aclweb.org/anthology/W11-0310.

   J. Schler, Moshe Koppel, S. Argamon, and J. Pennebaker. 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. URL http://u.cs.biu.ac.il/~koppel/papers/springsymp-blogs-07.10.05-final.pdf.

   H. Andrew Schwartz, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski, Stephanie M. Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin E. P. Seligman, and Lyle H. Ungar. 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE, 8(9). E73791. doi:10.1371/journal.pone.0073791.

   Parag Singla and Matthew Richardson. 2008. Yes, there is a correlation - from social networks to personal behavior on the web. In Proceedings of WWW, Refereed Track: Social Networks & Web 2.0 - Analysis of Social Networks & Online Interaction, pages 655–664. Beijing.

   Luke Sloan, Jeffrey Morgan, Pete Burnap, and Matthew Williams. 2015. Who tweets? deriving the demographic characteristics of age, occupation and social class from twitter user meta-data. PLoS ONE, 10(3). E0115545. doi: 10.1371/journal.pone.0115545.

   Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology (JASIST), 60(3):538–556.

   Efstathios Stamatatos, Walter Daelemans amd Ben Verhoeven, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. 2015. Overview of the Author Identification Task at PAN 2015. In Linda Cappellato, Nicola Ferro, Gareth Jones, and Eric San Juan, editors, CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 8-11 September, Toulouse, France. CEUR-WS.org.

   Yla R. Tausczik and James W. Pennebaker. 2010. The psychological meaning of words: Liwc and computerized text analysis methods. Journal of Language and Social Psychology, 29:24–54.

   Joel Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 48–57. Atlanta, GA. URL http://www.aclweb.org/anthology/W13-1706.

   Joel Tetreault, Daniel Blanchard, Aoife Cahill, and Martin Chodorow. 2012. Native tongues, lost and found: Resources and empirical evaluations in native language identification. In Proceedings of COLING 2012, pages 2585–2602. Mumbai. URL http://www.aclweb.org/anthology/C12-1158.

   Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner, and Isabell M. Welpe. 2010. Predicting elections with twitter: What 140 characters reveal about political sentiment. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 178–185. Washington, DC.

   Benjamin Van Durme. 2012. Streaming analysis of discourse participants. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 48–58. Association for Computational Linguistics, Jeju Island, Korea. URL http://www.aclweb.org/anthology/D12-1005.

   Svitlana Volkova. 2015. Predicting Demographics and Affect in Social Networks. Ph.D. thesis, Johns Hopkins University, Baltimore, MD.

   Svitlana Volkova, Glen Coppersmith, and Benjamin Van Durme. 2014. Inferring user political preferences from streaming communications. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 186–196. Baltimore, Maryland. URL http://www.aclweb.org/anthology/P14-1018.

   Svitlana Volkova, Theresa Wilson, and David Yarowsky. 2013. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1815–1827. Association for Computational Linguistics, Seattle, Washington, USA. URL http://www.aclweb.org/anthology/D13-1187.

   Felix Ming Fai Wong, Chee Wei Tan, Soumya Sen, and Mung Chiang. 2013. Classifying political orientation on twitter: Its not easy! In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM), pages 640–649. Ann Arbor, MI.

   Sze-Meng Jojo Wong and Mark Dras. 2011. Exploiting parse structures for native language identification. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1600–1610. Edinburgh. URL http://www.aclweb.org/anthology/D11-1148.

   Shuang-Hong Yang, Bo Long, Alex Smola, Narayanan Sadagopan, Zhaohui Zheng, and Hongyuan Zha. 2011. Like like alike: Joint friendship and interest propagation in social networks. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 537–546. ACM, New York.

   Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 61–72. Association for Computational Linguistics, Seattle, Washington, USA. URL http://www.aclweb.org/anthology/D13-1007.

   Tae Yano and Noah A. Smith. 2010. What’s worthy of comment? content and comment volume in political blogs. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 359–362. Washington, DC.

   Tae Yano, Dani Yogatama, and Noah A. Smith. 2013. A penny for your tweets: Campaign contributions and capitol hill microblogs. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (ICWSM), pages 737–740. Ann Arbor, MI.

   Faiyaz Al Zamal, Wendy Liu, and Derek Ruths. 2012. Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (ICWSM), pages 387–390. Dublin.

   Cathy Zhang and Pengyi Zhang. 2010. Predicting gender from blog posts. Technical report, University of Massachusetts, Amherst.