Dokumentation der Konvertierungspipeline

Version 1.2+: Using the generic PEPPER conversion path from MS Excel to ANNIS

As of v1.2, the Old German Reference Corpus is annotated in spreadsheet format. The software currently in use is MS Excel 2016 and 2019. The .xlsx file format allows full use of the generic conversion tool PEPPER without the need for additional conversion algorithms, which are maintained exclusively for converting the reference corpus into ANNIS (see the section “Version 0.1 to 1.1” below).

PEPPER is part of the ANNIS world and is maintained by the developers of ANNIS. Information about PEPPER and a documentation of the generic conversion tool and process can be found on the following websites:

Pepper Homepage
Pepper Guide


Version 0.1 to 1.1: From Elan to Annis

Documentation for the conversion path of Elan files to their representation in ANNIS

This document gives practical guidelines to follow when moving the Elan files of the DDD-AD project, across some hurdles, to the corpus search program ANNIS.



             “A black cat crossing your path signifies that the animal is going somewhere.”

                                                                                                                        (Groucho Marx)

1. Data management

In this section, the file structure is discussed in detail, including best practice guidelines for dealing with in-between stages and introducing new changes.

1.1 Storage of Elan files in the Repositorium

At the server storage disk P:\ of the IdSL, the Elan files of the DDD-AD project are stored at the path P:\SFB632AD\Repositorium. This directory is the so-called repositorium in which all Elan files, ANNIS files and further tools that relate to the project are stored.

The file structure is as follows.

At the top of the directory Repositorium, there are some individual files that are used during conversion or are ANNIS configuration files.

  • “metadata.txt” contains the metadata for the individual documents in a tab delimited format
  • “AS-mapping.txt” contains a concordance that allows a normalization of Old Saxon lemmata to the Tieffenbach lemmata.
  • “ddd.json” is an ANNIS configuration file that will be used to make an ANNIS instance with corpus sets
  • “DDD from Elan to Annis.docx” is this document

At the top of the directory “Repositorium”, there is a directory with the name “_tools”. It contains some tools that are needed for preparing the Elan files for conversion.

At the top of the directory “Repositorium”, there are also directories with names that represent the subcorpora. These subcorpora are usually individual texts, such as “Otfrid” or “Tatian”, but can be a collection of texts, such as the “Kleinere Althochdeutsche Denkmäler”. There is one directory “_obsolete” where one can store errors or legacy directories.

Within each subcorpus directory, multiple directories represent the different stages of completion (cf. Section 1.2). There are also some other directories:

  • “meta” contains the metadata for the subcorpus, derived from the metadata.txt file at the top of the Repositorium
  • “relannis” contains the relAnnis files from a previous conversion round
  • “relannis_bkp” contains the backup relAnnis files from the previous conversion round

Apart from these directories, there are also some files. These files are used during conversion and their use will be explained below.

  • “addorder.prop”
  • “bearbeitung2public.properties”
  • “elan2relannis.pepperParams”
  • “finishRelannis.py”
  • “special.params”
  • “log.txt”

1.2 Saving in-between stages

It is important to keep track of in-between stages of completion so that at a later point in time, certain decisions can be verified, or so that there is a recent “back up” in case something goes wrong during an update round.

The idea is that with every update round, the number in front of the directory name increments with one. The first version is always the “0_vorannotation” directory, which contains the raw Elan files that were created in Frankfurt. From then on, there typically is a “1_student” directory, which contains the first student corrections. Then, the “2_korrektur” directory contains the checks of the students. In the “3_excel” directory, Excel lists of the Elan files are stored, in which consistency errors are tracked and marked; the “4_excel-corrections” contains the Elan files after their inconsistencies (discovered with the Excel files) are ironed out.

It is well possible that there are some more directories after that. This is quite normal as further mistakes need to be taken care of. With every update round, a new directory, with an incremented first number, needs to be created. That directory will then contain all Elan files in their updated form. It is important that every directory contains all files!

1.3 Introducing new changes

So, what is the step-by-step procedure if one wants to make a correction? Let us start from a situation where the current stage is represented by the directory “8_morphology”. In this stage, several smaller errors have been discovered during corpus research that now need to be introduced in the files.

  1. Create a directory with a name that starts with 9, e.g. “9_corrections-summer-14”, as to increment the version from 8 to 9.
  2. Copy all files in the “8_morphology” directory to the “9_corrections-summer-14” directory.
  3. Perform the changes on the files in the “9_corrections-summer-14” directory.

2. Tools for improving the annotations

There are two tools for improving the annotations in a (set of) Elan file(s). The first tool (2.1) provides lists of annotations so that consistency can be checked, the second tool (2.2) automatically performs corrections in the Elan files. Both tools are packaged together as a single Java program, named “dddtools.jar”, which can be found in the “_tools” directory of the “Repositorium” directory.

2.1 Excel lists

For the creation of Excel lists, which transform the Elan files into Excel spreadsheets, a tool from “dddtools.jar” can be called with the flag “excel”. This program will turn every individual Elan file in a specific directory into a CSV file. The program takes one argument that refers to the folder in which the Elan files reside that need to be converted to CSV files (mind the slash at the end of the path).

java –jar dddtools.jar excel /path/to/directory/with/elan/files/

The CSV files are written into the same folder as the path that was given. The correct procedure is thus to (1) create a new folder with incremented status number, e.g. “3_excel”, (2) within this folder, create a folder “file-by-file” in which all the Elan files must be copied, (3) run the excel tool (with the path to the “file-by-file” directory) which will create csv files for each file in the “file-by-file” directory.

These separate CSV files can be grouped into several larger files, which then each contain a number of CSV files. This is necessary to allow the annotators to investigate consistency across files. The grouping of the files is done by a small Python script “excelcombine.py”. Within this Python script, one needs to change the path to the corpus (so something like “Repositorium/BenediktinerRegel/” and the path the CSV files within that corpus (something like “3_excel/file-by-file/*.csv”):

14 # path to corpus

15 path = "/path/to/corpus/"
16 # path to files

17 fl = glob.glob(path + "3_excel/file-by-file/\*.csv")

The script will then create combined CSV files in the corpus directory.

python excelcombine.py
As a service to the annotators, you could import these CSV files into Excel and turn them into regular .xslx files, so that they do not need to deal with the text import functionality of Excel. By the way, this script creates UTF-8 files.

2.2 Automatic error replacement

The automatic replacement of systematic errors is performed during the preparation of derived Elan files that shall not be used for further annotation, but that are only there as the basis for conversion towards ANNIS. This procedure will be discussed in Section 3, and makes use of the “DDDpreparer.jar” program. Here, however, we only spend some time on how to provide the parameters for this automatic error replacement.

The “dddtools.jar” program contains a tool that works off the “bearbeitung2public.properties” file, and will produce not only the Elan files that are going to be used for conversion, but also a “log.txt” file that will list the automatic changes that have taken place.

 Java –jar dddtools.jar prepare /path/to/bearbeitung2public.properties
There are quite a few settings possible in “bearbeitung2public.properties”, but here we focus on the search and replace possibilities.

The main format of these search and replace definitions is as follows: “find in level level1 all annotation values that have value value1, and replace in value1 the (sub)string value2 with the value value3, on the condition that left/right/align on level2 you can find the value value4. The conditions can be underspecified, and there can be as many conditions as you want. This supports regular expressions

As an example, take the easiest version, where there are no conditions:

S1a Satz, .+ _Att, Att, Rel

This is parsed as follows:

  • Level1 = S1a Satz
  • Value1 = .+_Att
  • Value3 = Att
  • Value4 = Rel

And performs the following steps:

Go through level “S1a Satz” (level1) and verify annotation values that end in “_Att” (value1). For all these values (level1), replace “Att” (level2) with “Rel” (level3).

Of course, it is possible to add conditions. Take for example:

M1a DDDTS Lemma, PI, PI, DI, align, Lemma, iogilīh

This is parsed as follows:

  • Level1 = M1a DDDTS Lemma
  • Value1 = PI
  • Value2 = PI
  • Value3 = DI
  • Condition = [align, Lemma, iogilīh]

And performs the following steps:

Go through level “M1a DDDTS Lemma” and verify annotation values that are “PI”. For these value, replace “PI” with “DI”, under the condition that the “aligned” annotation at level “Lemma” equals “iogilīh”.

Every single change that was automatically made is documented in detail in the “log.txt” file.

3. Preparing Elan files for conversion

3.1 Metadata

At the top of the “repositorium” directory, there is a “metadata.txt” file that contains the meta information for every single document in the corpus in a tab-delimited format. The preparation tool uses this file to generate a text file in the corpus directory (within directory “meta”). This text file is then later on used during conversion to Elan. Before performing the Elan to relANNIS conversion with Salt ‘n Pepper, it is advised to carefully verify that text file.

3.2 DDDpreparer.jar

DDDpreparer.jar is run in a terminal of your choice by means of the command (on one line):

java –jar dddtools.jar prepare /path/to/bearbeitung2public.properties
The program goes through all files that are specified in the “bearbeitung2public.properties” file, with the “source” property (mind the slash at the end). Several improvements of the files are made, including automatic error replacement (Section 2.2), renaming and removing levels, fixing time slots, etc. Verify the paths to the directories at the bottom of the file, since these paths are likely to change with every conversion round, i.e. they should point to the latest stage in the annotation updates.

3.3 Result of preparation step

At the end of the preparation step, a folder called “DDD-” is created in which the prepared files reside. These files are only intended for conversion to relAnnis, and should not be used for further annotation!

4. Conversion with Salt ‘n Pepper

Salt ‘n Pepper is the conversion framework with which the Elan files are converted to the ANNIS database format. You can obtain Salt ‘n Pepper from here:

https://korpling.german.hu-berlin.de/p/projects/saltnpepper/wiki/

The running of Salt ‘n Pepper is explained in the wiki, and in much less detail below.

4.1 The configuration files

“elan2relannis.pepperParams” is the core file for conversion. It makes the different steps explicit that need to be followed during conversion. In our case, there are three steps.

The corpus is first imported by means of the ElanImporter. The Elan files can be found in the folder “./DDD-”, and there is a file “./special.params” that further modifies the importer.

 <importerParams moduleName="ElanImporter" sourcePath="./DDD-<subcorpusname>" specialParams="./special.params"/>
Then, segmentation layers are added with the “OrderRelationAdder” module. The module is modified by a special parameters configuration file “./addorder.prop”.

<moduleParams moduleName="OrderRelationAdder" specialParams="addorder.prop"/>

And finally, the corpus is exported to the relAnnis format, version 3.2, and the relAnnis files will be saved in the “./relannis” directory.

<exporterParams formatName="relANNIS" formatVersion="3.2" destinationPath="./relannis"/>
These three steps are at the center of the conversion. As you can see, the import of the Elan files relies on a further configuration file “special.params”.

The “special.params” file sets the following information. Which annotation layer needs to be used as the source for the primary data. Typically, this is the annotation layer with the most annotations.

elan.importer.primTextTierName=character

It is also necessary to inform the module about those annotation layers that are going to be used as segmentation layers.

elan.importer.segTierNames=ling,edition

Perhaps you want to ignore some annotation layers in your corpus?

elan.importer.ignoreTierNames=segm,comp,character

In case there is a related file that contains a parallel text (such as glosses), you can specify the directory in which these related files are stored.

elan.importer.linkedFolder=

Finally, the simple file “addorder.prop” tells the conversion tool, which layers need to be segmentation layers in ANNIS.

segmentation-layers={tok, ling, edition}

4.2 Running Salt ‘n Pepper

In a Linux terminal of your choice you can run Salt ‘n Pepper with the following command:

bash pepperstart.sh –w /path/to/elan2relannis.pepperParams

5. Cleaning up the Salt ‘n Pepper output

Salt ‘n Pepper obviously produces valid output, and it is perfectly possible to import the relAnnis files that come out of the conversion step. However, we perform a couple of final brush ups to make the output more attractive.

5.1 Backing up the initial output

Before we perform these brushing up steps, we backup the original output of the Salt ‘n Pepper conversion in the directory “relannis_bkp”. By doing so, we always have a vanilla Salt ‘n Pepper output available that we can reuse in case that our brush up steps break the relannis structure.

So, simply copy the relannis files from the “relannis” directory into the “relannis_bkp” directory and replace the available files.

5.2 Run finishRelannis.py

All brushing up steps are performed automatically by means of the finishRelannis.py script. In a terminal of your choice, run

python finishRelannis.py
and the script will look for the relannis directory and perform the cleaning up.

In the background, a couple of things happen.

  • Creation of HTML visualization
  • Modification of visualizer configuration
  • Setting default parameters for ANNIS
  • Creation of example queries
  • Remove “default_ns” messages
  • Translate meta levels to English

6. Import into Annis

6.1 Trying out in local Annis Kickstarter

Before you bring the final product of the conversion to the corpus linguists for import into the online ANNIS instance on their server. It is important to try out the import in a local ANNIS instance, called the kickstarter. Install the kickstarter on your system, and try the import. If it works, open the kickstarter and ANNIS in a browser to verify the integrity of the corpus. Make a couple of searches, open the annotations grid and the edition reading panel. Once you are up with the corpus linguists, nothing can go wrong.

6.2. Upload to korpling server

After carefully testing the corpus locally, you can go to Thomas Krause (3.333) to have the corpus imported into the online version of ANNIS. If you did your testing correctly, everything should go smoothly. There is no problem if something should go wrong once or twice, but try to keep mistakes to a minimum.

7. Dealing with error messages

7.1 dddtools.jar prepare

  • IOExceptions during the starting of the program probably indicates that the “bearbeitung2public.properties” file has an incorrect file reference.
  • “Error: , , , ” indicates that there is something wrong in the input Elan file at tier and at the given begin- and endtime.

7.2 Salt ‘n Pepper

  • Something wrong at to : as the message says, the Elan file contains a structure from begin- to endtime that is not convertible by Salt ‘n Pepper. Return to the original file and try to find out what might be wrong.

7.3 Import in Annis

  • Error in _node_annotation: Usually, this error is followed by a line number. In the relAnnis file node_annotation.tab, there usually is a newline or a tab at that line. Simply remove the newline or tab.

8. Short guide for conversion (Linux)

cd /path/to/dddtools/
java –jar dddtools.jar prepare /path/to/bearbeitung2public.properties
cd /path/to/Saltnpepper
bash pepperstart.sh –w /path/to/elan2relannis.pepperParams
cd /path/to/subcorpus
cp –r relannis relannis_bkp
python finishRelannis.py

Tools

Informationseinlesung

  1. Digitalisierung der Textwörterbücher
    Die Textwörterbücher [ohne Splett] wurden digitalisiert und im XML-Format gespeichert.

  2. Zuordnung Textstellen-Wörterbucheinträge
    Bei der Aufbereitung der einzelnen Texte werden für jedes Wort die Vorkommen der entsprechenden Belege in den Wörterbüchern gesucht und entsprechend das passende Lemma (bzw. die Lemmata, falls mehrere möglich sind) sowie grammatische Informationen zu Lemma und Beleg zugewiesen.

  3. Anreicherung der grammatischen Informationen
    Die grammatischen Informationen aus den Wörterbüchern werden anschließend mithilfe weiterer Informationen aus den Standard-Grammatiken weiter ergänzt.

  4. Übertragung der gewonnenen Daten in ELAN-Format
    Die Daten werden nun in das Format der Software ELAN transponiert, um so manuell weiterbearbeitet werden zu können.

  5. Ermittlung von Standard-Wortformen
    Sobald die manuelle Erstbearbeitung abgeschlossen ist und die Lemmata und Übersetzungen gemäß Spletts Althochdeutschem Wörterbuch angepasst wurden, können mithilfe der Informtationen aus den Standard-Grammatiken die Wortformen im Standard der jeweiligen Sprachstufe ermittelt werden. (noch nicht implementiert)

  6. Übergabe der Daten
    Nach Abschluss der manuellen Bearbeitung werden die Texte ins Format der Software EXMARaLDA konvertiert, um dann von der Datenbank ANNIS eingelesen werden zu können.

Bearbeitungstools

  • – ELAN: http://www.lat-mpi.eu/tools/elan/
  • – EXMARaLDA: http://www.exmaralda.org/
  • – Perl: http://www.perl.org/

Wörterbücher

  • – Heffner, Roe-Merill S. (1961): A Word-Index to the Texts of Steinmeyer. Die kleineren althochdeutschen Sprachdenkmäler. Madison: The University of Wisconsin Press.
  • – Hench, George Allison (1890): The Monsee Fragments. Staßburg: Trübner.
  • – Hench, George Allison (1893): Der althochdeutsche Isidor. Straßburg: Trübner.
  • – Kelle, Johann (1881): Glossar der Sprache Otfrids. Regensburg: Manz.
  • – Sehrt, Edward (1955): Notker-Wortschatz. Halle: Niemeyer.
  • – Sehrt, Edward (1966): Vollständiges Wörterbuch zum Heliand und zur altsächsischen Genesis. Göttingen: Vandenhoeck & Ruprecht.
  • – Sievers, Eduard (1892): Tatian. Lateinisch und althochdeutsch mit ausführlichem Glossar. 2. Auflage. Paderborn: Schöningh.
  • – Splett, Jochen (1993): Althochdeutsches Wörterbuch. Berlin: de Gruyter.

Grammatiken

  • – Braune, Wilhelm (2004): Althochdeutsche Grammatik. 15. Auflage. Band I: Laut- und Formenlehre. Bearbeitet von Ingo Reifenstein. Tübingen: Niemeyer.
  • – Gallée, Johan Hendrik (1993): Altsächsische Grammatik: 3. Auflage mit Berichtigungen und Literaturnachträgen von Heinrich Tiefenbach. Tübingen: Niemeyer.