Tools for annotating and searching corpora
Corpus linguistics aims at investigating linguistic research questions by means of digital corpora. In the simplest case, corpora provide illustrative examples for a certain phenomenon. In more interesting cases, corpora can be used for examining properties of a phenomenon, and for generating or validating linguistic hypotheses and generalizations.
An important step in corpus creation is corpus annotation. Annotating a corpus means enriching it with linguistic information, e.g. parts of speech (e.g. nouns) or syntactic categories (e.g. NPs). Annotations thus allow us to search for complex phenomenon, such as constituent order in main vs. subordinate clauses.
This course is about creating annotations as well as using them in corpus searches. While there are many tools for manual annotation, the course focuses on tools that support (semi-)automatic annotation. Furthermore, tools for searching and visualizing linguistic data are addressed.
In the afternoon sessions, selected aspects of the theoretical sessions are put into practice, among others: we annotate some data and measure agreement between multiple annotators; we apply automatic tools and evaluate their performance; we search different corpora for specific phenomena.
PlanMonday: General introduction to corpus annotation
Types of corpora Types of annotation Available corpora
Tuesday: Introduction to automatic tools
Basic principles: training, testing, smoothing, evaluation
Wednesday: Application of automatic tools
Tools: tokenizers, part-of-speech taggers, parsing, and others
Query languages Qualitative and quantitative analyses
Learner corpora and Second Language Acquisition research
Learner corpora are collections of texts (written or spoken) by learners of a language (here: learners of a foreign language). Learner corpus data is one type of evidence for the study of acquisition processes. In this course, we discuss the collection, annotation and analysis of learner corpus data. The two main ways of analyzing learner corpus data are error analysis and comparative analysis. In error analysis, a problematic learner utterance is compared to a (supposedly) correct utterance and the difference is categorized. In (1) (from the Falko-Korpus, a corpus of essays written by advanced learners of German) there are four errors, one lexical (wrong word), one orthographic, two grammatical (agreement).
|"when one has received a driver's license”|
By analyzing errors, we can develop and test hypotheses about the learner’s interlanguage. We can compare errors between native speakers and learners or between different learner populations, for example, from learners with different L1s, or learners at different stages of acquisition. However, error analysis has many conceptual and methodological problems because a learner utterance must be interpreted before error categorization and there can be different interpretations for the same utterance. In (1), for example, the indefinite article is possible but not quite idiomatic (den Führerschein machen instead of einen Führerschein machen). Should this be counted as a lexical error (collocation) or a definiteness error? Or not at all?
Corpus data is usage data – which means that usage patterns can be analyzed and compared (again, between native speakers and learners or between different learner populations). We can then find out which categories (words, parts-of-speech, syntactic constructions, etc.) are overused or underused by a given learner population. This might help us understand which categories are perceived as difficult or which categories are influenced by a given L1. Again, the corpus data must be interpreted in order to be useful, and this is again difficult. Since grammatical models describe ‘grammatical’ structures, it is often unclear how a learner utterance should be categorized.
In this course, we will talk about the possibilities of using learner corpora as well as about the strategies for interpretation and annotation.
PlanMonday: General introduction to learner data
Theoretical: Usage data and acquisition models
Methodological: Corpus design
Tuesday: Error analysis
Theoretical: Comparative fallacy
Methodological: Error annotation, target hypotheses
Wednesday: Comparative analysis
Theoretical: usage frequencies and competence, variation between registers
Methodological: General issues on comparing corpora , overuse/underuse studies, within-group variation vs. between-group differences
Thursday: Variationist studies (mainly exemplified by spoken data)
Methodological: Variationist annotation
In the afternoon sessions, we will annotate English learner corpus data (advanced learners) on several levels. We will focus on the conceptual issues behind annotating data that does not conform to ‘standards’. Depending on the interest and prior knowledge, we can work on lexical annotation, syntactic annotation, or phonological annotation.
Statistical methods for corpora (using R)
This course is a primer on statistical analysis of language data with a focus on corpus materials, using the freely available statistics software ‘R’. After a quick overview of foundational notions for statistical evaluation, hypothesis testing and visualization of linguistic data, we will explore methods needed for understanding and doing current quantitative research. We will discuss descriptive, inferential and exploratory statistics, including looking at correlations, significance testing, and an introduction to regression modeling for language data. The course assumes basic mathematical skills and familiarity with linguistic methodology, but does not require a background in statistics or R.
PlanMonday: Introduction, variables and estimators, basic descriptive statistics
Tuesday: Variance, correlation, visualization basics
Wednesday: Significance, hypothesis testing, t-test
Thursday: Chi-square and ANOVA
Listeners generalize across utterances from different speakers, dialects, listening conditions, and contexts. I show how a cognitive model that operates over speech corpora can be used to evaluate hypotheses about the dimensions of speech that guide these generalizations. Simulations show that representations that are normalized across speakers predict human discrimination data better than unnormalized representations, mirroring previous findings. The model also reveals differences across normalization methods in how well each predicts human data. These results indicate that cognitive modeling can be used to quantitatively evaluate different representations of speech, yielding consistent and interpretable results.