About The Collection

The collection of Medieval Occitan is an on-going effort to assemble an extensive collection of Old Occitan electronic texts. All texts will be annotated and converted into xml formats for wider use. In addition, we seek to provide available English translations if such permission is given by publishers.

We believe that a digitized and annotated corpus of Old Occitan would be a valuable resource not only for corpus linguistics studies but also for a more general audience who wishes to become acquainted with Provençal literature.

Corpus Description

The compilation of the corpus consists of the following steps:

  1. OCR correction
  2. Tokenization, lemmatizing, tagging and parsing
  3. Converting to ANNIS format 1

TNT tagger (Brant, 2000) and Berkeley parser (Petrov et al. 2006) were trained on the MCVF corpus of Old French (Martineau, 2007)2. At present, we have annotated Flamenca and Boece (See section Corpus Search for tags descriptions).

We welcome comments, notes on any errors or problems encountered by users. Please contact Olga Scrivner (obscrivn AT indiana PERIOD edu).

1. http://www.sfb632.uni-potsdam.de/annis/

2. We would like to thank Professor France Martineau for permission to use the MCVF corpus as a training model for our tagger and parser. (France Martineau. 2010. Corpus MCVF, modé́liser le changement: les voies du français)