Old German Reference Corpus
The aim of the DFG-funded project “Old German Reference Corpus” (Referenzkorpus Altdeutsch) is to provide a linguistically parsed digital database of the entire text corpus (ca. 650 000 words) dating back to the early stages (around 750 – 1050 AD) of written German (Old High German and Old Saxon) on the basis of most accurate text editions.
The parsing comprises header-information, structural (word, sentence, line, paragraph etc.) and linguistic annotation (part-of-speech tagging, inflection) as well as syntactic information. This is carried out with the help of a semi-automatic pre-annotation, which has been generated on the basis of the variety of digitalized dictionaries and glossaries of Old High German and Old Saxon. Different levels of parsing are connected to each other by the means of the multiple-level-architecture.
Due to the fact that a part of the Old German record depends more or less on Latin archetypes Latin parallel recordings are included in the database as well to provide a better opportunity for identifying interferences. They are annotated in the same way as the Germanic corpus. Besides, the texts are aligned to each other, so that the interferences of Latin into the German text can be identified and analyzed.
The project is carried out in Berlin, Frankfurt and Jena under the direction of Prof. Dr. Karin Donhauser (Humboldt-University of Berlin), Prof. Dr. Jost Gippert (University of Frankfurt am Main) and Prof. Dr.Rosemarie Lühr (University of Jena).
With the preparation of the Reference Corpus the basis for a larger historical complex corpus of German is being created. The cooperation with the similar projects for Middle High German and Early Modern German at the Universities Bonn and Bochum assures the compability of the different annotation systems, so that the corpora can be united in a single data base later.
As the parsed information can be transferred in the STTS (Stuttgart-Tübingen Tagset) system, it is possible to work with these historical corpora with the queries common for the modern language corpora as well. The annotated texts are included into the ANNIS data base.