Annotated Recordings and Texts in the DoBeS Project
Hennie Brugman
Max-Planck-Institute for Psycholinguistics, Nijmegen, Netherlands


1.0 Introduction  

The DoBeS1 (Dokumentation Bedrohter Sprachen – Documentation of Endangered Languages) project was initiated by the Volkswagen-Stiftung2 to support the conservation of endangered languages and cultures.

During it’s initial two year pilot phase, starting in September 2000, six documentation teams and one archiving team collectively discussed best practice and tried to formulate initial practical working agreements on a wide range of issues. Procedures were worked out to deal with digitization and archiving of audio and video recordings and were adapted to individual documentation team’s needs. Agreements were reached on minimal requirements for annotated recordings and texts and on tag sets that are used. Substantial input was given to the development of the IMDI3 metadata set to make sure that IMDI can properly deal with linguistic field data. Available software tools were evaluated and software development was started to deal with IMDI metadata, linguistic annotations of multimedia data, lexical data, file format conversions and archive access. Legal and ethical issues were discussed at length. Currently 21 documentation teams participate in the DoBeS program and more are expected.

Until now most software development by the DoBeS archiving team has focused on tools and infrastructure for dealing with metadata for linguistic resources, and for dealing wit linguistic annotations. This paper will focus on the current state of our modeling and development work on linguistic annotations of recordings and texts.


2.0 A model for linguistic annotations

2.1 The Abstract Corpus Model
 

Since the middle 90’s we have been involved in the design and construction of tools for creating, viewing and analyzing linguistic annotations of digital video and audio data (MediaTagger & CAVA4, Eudico5). Because user requirements and the state of technology changed over the years we went through several revisions of our models (Brugman & Wittenburg, 2001). Currently we are using the second revision of an abstract object-oriented model for linguistic annotations that we called ACM (Abstract Corpus Model).

In discussions during the DoBeS pilot phase an agreement was reached on a minimal set of required annotation tiers, and also a maximal proposal (Advanced Glossing, Drude, 2002) was presented and discussed. DoBeS minimally requires a "rendered text" tier, that can be either an orthographic or a phonetic/phonemic transcription and a tier that contains a translation into a "major language". Further, DoBeS recommends strongly to add a morpho-syntactic tier for a sufficiently large portion of the corpus for each documented language. Were possible, the "rendered text" tier should be aligned with digitized audio or video recordings.

The Advanced Glossing proposal recommends separate glossing tables for syntax and morphology, and describes 12 different tiers for each, with the intention to provide a well-defined slot for each piece of annotation data.

An analysis of the annotation structures discussed and proposed, showed that support for interlinear text and (syntactic) trees structures was required. From other ongoing projects support for large numbers of independent tiers, for feature clusters and for annotation of non-contiguous ranges (e.g. co-reference) was required.

Figure 1 shows a UML (Unified Modeling Language) class diagram with the concepts from the ACM that are relevant in the context of this paper, and their relations.

Figure 1: an extraction from the Abstract Corpus Model represented as a UML class diagram

An AnnotationDocument contains Annotations of different kinds that all pertain to one or more intervals of time, typically within a video or audio recording. This AnnotationDocument is considered a linguistic resource as defined by IMDI, and can be described by IMDI metadata.

An AnnotationDocument can contain Tiers that in turn can contain Annotations. Our working definition of a Tier is as follows:

A tier is a group of annotations that all describe the same type of phenomenon, that all share the same metadata attribute values and that are all subject to the same constraints on annotation structures, on annotation content and on time alignment characteristics.

Most elements of this definition are dealt with by concepts represented in figure 1. A "group of annotations" is covered by the AnnotationContainer concept (the triangular symbol between Tier and AnnotationContainer means that Tier "is-a" AnnotationContainer). An AnnotationContainer contains Annotations, of which there exist two different types: AlignableAnnotations and ReferenceAnnotations. AlignableAnnotations have two TimeSlots, which represent a begin point and an end point in media time. All TimeSlots in an AnnotationDocument are explicitly ordered in time, but not all TimeSlots have to be filled in with an actual media time (this allows for a mix of time-aligned and non-time-aligned annotations in one document or for time-alignment of pre-existing text documents at a later stage). ReferenceAnnotations refer to one or more other Annotations (either Alignable or ReferenceAnnotations).

"Type of phenomenon" is represented by LinguisticType. This defines the semantics of annotation values on tiers of this type, it defines whether annotations are Alignable or ReferenceAnnotations, and it can be associated with Constraints on annotation content, on how annotations can be connected to each other, and on how annotations can be aligned with media time. Examples of constraints on annotation content are that an annotation value should be part of some Controlled Vocabulary, or that the value is a URL, or that the value can only use a specific range of Unicode characters.

Given the concepts of the ACM and the ways they are associated, it was straightforward to design a format to deal with persistence and implement it using XML. Inspection of the DTD for this format shows the close similarity of the model and the format. The DTD and a short discussion of it are added as an appendix.

Figure 2 shows an example of how instances of these concepts can be combined to form complex linguistic constructs (in this case a block of interlinear text). The figure shows a gray box representing a (video or audio) media signal. Left-to-right represents the media time axis. The TimeOrder box contains circles representing TimeSlots. The black circles are aligned with media time, the gray ones are not. All other boxes represent different tiers. The "utterance" and "word" tiers contain arcs between two TimeSlots with a label on them. These arcs represent AlignableAnnotations with the labels as their values. All other tiers contain text labels with arrows to some other annotation, these represent ReferenceAnnotations. Tiers contain only AlignableAnnotations or only ReferenceAnnotations, depending on the LinguisticType for that Tier.

The arrows on the left of the boxes represent parent-child relationships that can (but need not) exist between Tiers. In this way tiers can be hierarchically organized. These hierarchies between tiers reflect hierarchies between annotations on those tiers, and can be used to suggest parent candidates for annotations that are being newly created.

AlignableAnnotations on the "word" tier are connected to annotations on the "utterance" tier by sharing TimeSlots, thus forming a graph. Unlike pure annotation graphs (Bird & Liberman, 2001), the model’s additional parent-child relationships between tiers uniquely represent hierarchies of annotations.

Figure 2: combining objects from the annotation model (interlinear text)

2.2 Representation of linguistic constructs with the ACM model
 

As a proof of concept this section will explore if and how linguistic constructs, as required by DoBeS and others, can be represented by the ACM. To create such representations we had to define a number of specific constraints on annotation structures and time alignment. One such constraint is "time subdivision", which can be seen as a combination of the constraint "time inclusion" (begin and end times of an annotation should be between the begin and end times of an annotation on a parent tier) and "no time gaps allowed between child annotations of the same parent".

A second stereotypic constraint is needed: "symbolic subdivision". This means that annotations on a tier that has this associated constraint have exactly one parent annotation on a specific parent tier, and that all annotations that refer to the same parent are explicitly ordered.

The stereotypic constraint "symbolic association" means that there is a one-to-one relation between an annotation and it’s parent annotation.

In figure 2, we now have all the necessary ingredients to represent interlinear text. The "word" tier is either a "time subdivision" or a "symbolic subdivision" of it’s parent tier, the "utterance" tier. Which choice is made depends on whether individual words will ever be time-aligned. The "morpheme" tier is a "symbolic subdivision". It makes no sense to make it a "time subdivision" since morphemes can not be located in the annotated signal. However, a best guess about a morpheme’s containing time interval can be made by tracing back the annotation hierarchy of which the morpheme is part until the first alignable annotation that is time-aligned. The "part of speech" and "gloss" tiers are "symbolic associations" of the "morpheme" tier.

Setting up a group of tiers in this way, with proper LinguisticTypes and Constraints, fully characterizes interlinear text, even without any annotation existing on any of the tiers.

Figure 3: feature cluster

Figure 3 is an ACM representation of a feature cluster: some time interval that has a number of associated feature values. This annotation setup is quite common for the annotation of for example video recordings of gestures, where the number of features for one gesture or phase within a gesture can be substantial.

Feature clusters can be easily represented by making all feature tiers "symbolic associations" of the same root tier.

Figure 4: co-reference

To represent co-reference (figure 4) we need one more structural constraint: "annotations have one or more references to annotations on a specific parent tier". Note that there is no objection at all to associate a co-reference tier to a word tier that is already part of, for example, an interlinear text.

Figure 5: syntactic tree

To conclude this section, (syntactic) trees with an undetermined number of levels can be represented with the same set of basic ingredients, as is shown in figure 5. Again a new structural constraint is necessary, specifying that "an annotation either refers to annotations on some specific parent tier or to annotations on it’s own tier. Note that all annotation values on the "syntax" tier may comply also to a constraint on annotation content: a closed vocabulary with syntactic labels.


3.0 Using the model: implementation and tools
 

The ACM is implemented using Java as implementation language. The development team tries to use the good practice of defining interfaces and implementing default behavior in abstract classes. This allows tool developers that use ACM classes to program to the interfaces, making it possible to have alternative or better implementations. For example, it is quite easy to read some existing annotation file format and use it to instantiate an ACM AnnotationDocument containing Tiers, Annotations, etcetera. We successfully did this for a number of formats (e.g. the Childes6 CHAT format, a relational database with gesture annotation data, Shoebox7 files).

We developed a number of tools on basis of the ACM, the main ones being Corex8 (Corpus Exploitation software for the Spoken Dutch Corpus) and ELAN9 (Eudico Linguistic Annotator). ELAN is a multi–tier annotation tool for digital video and/or audio. It offers a number of different views on the same underlying annotation data. All of these views are optional, and they are all synchronized with time, and with respect to selected time interval and editing operations. ELAN is publicly available and it is open source.

To a large extent it already supports some of the complex annotation structures that were discussed in the previous chapter (interlinear text, feature clusters) and it is planned to support the remaining ones as well. For example figures 6 and 7 show interlinear text, that is internally represented as in figure 2, in two different ways. Figure 6 shows a time line representation while figure 7 shows the same annotation data in it’s interlinear form. The time line representation is best used to do speech transcription and time alignment of utterances and possibly individual words of which these utterances are made up, while the interlinear viewer can best be used to add, delete or modify annotations on the morpheme, part of speech and gloss tiers. Note that both viewers represent the actual media time, even during media playback (as indicated by respectively the red crosshair and the words highlighted in red).


Figure 6: ELAN with timeline viewer

Figure 7: ELAN with interlinear viewer

The fact that constraints are explicitly expressed in the ACM allows ELAN to actively enforce these constraints to guide the user. For example, if the user wants to select a time interval to be associated with a new annotation on a time subdivision tier, it is made impossible to extend the selection across the boundaries of it’s potential parent annotation. On a "symbolic association" tier it is impossible to add more than one annotation to the same parent, while this is possible for "symbolic subdivisions".

The interlinear viewer is a good example of what can be done to support more complex linguistic annotation constructs at the user interface level. It is a user interface component that is specialized for a set of tiers that are connected together in a stereotypical way: all tiers involved are part of one tier hierarchy, and all child tiers are either subdivisions or symbolic associations of their parents. A tool can make appropriate specialized user interface components available just by inspecting a document’s tier characteristics.

We foresee such special components for data entry and modification, for special visualization and for specification of queries. An example is a spreadsheet-style viewer and editor for feature clusters where each row represents all annotation values that are associated with the same root annotation and each column is associated with one tier. Such a viewer could also support table manipulations and filtering of rows on basis of annotation values. Another example is a syntactic tree-building and visualization component that visualizes syntax trees as tree graphs. For co-reference tiers, a component could be made that shows a simple list of annotated co-references. Clicking on an item would show all annotations that it refers to by superimposing a color highlight on it’s parent tier’s visualization.


4.0 Conclusion
 

Having an expressive and generic model for linguistic annotations is very beneficial for the construction of software tools for the creation, manipulation, and analysis of annotated text corpora, as well as for the design and documentation of archive formats. It also makes file format conversions and exchange of documents a lot easier.

For such a model it seems to be sufficient to have a few basic elements and a few basic ways to combine them, and to describe explicitly what the constraints on these elements and combinations are.

A wide range of linguistic annotation constructs can then be represented, and these constructs can be explicitly characterized and recognized by software tools. These tools can then choose to offer specialized user interfaces for the entry and manipulation of annotation data, for visualizing it, and for formulating queries on it.


Appendix: Eudico Annotation Format (EAF) DTD
 

This appendix contains the Document Type Definition for the EAF format. EAF’s elements closely resemble classes of the Abstract Corpus Model described in this paper. We think that EAF can be considered "portable" in the sense of (Bird & Simons, 2003), since it is open, it uses Unicode character encoding, and it makes it’s content as explicit as possible by using descriptive markup.

The use of Unicode and the complexity of the relations between EAF’s elements make that it is not directly human-readable, but the simplicity of the underlying model makes it at least easily human-analyzable.

The are some aspects of EAF that will be improved in the future. First, constraints are represented as descriptive text meant for human readers only. If feasible and useful, using a formal "annotation constraint language" would make machine interpretation possible as well. At least, constraints on annotation content should be represented formally, for example by linking to some repository of Controlled Vocabularies. Second, EAF document are now self-contained in the sense that annotations refer only to annotations in the same document. It may be necessary at some point to support references between documents. Finally, it should be possible for an EAF annotation document to refer to multiple signal files.

Here follows the current EAF DTD:

<!--
        Eudico Annotation Format DTD
        version 2.0
        June 19, 2002
-->
<!ELEMENT ANNOTATION_DOCUMENT (HEADER, TIME_ORDER, TIER*, LINGUISTIC_TYPE*, LOCALE*, CONSTRAINT*)>
<!ATTLIST ANNOTATION_DOCUMENT
          DATE CDATA #REQUIRED
          AUTHOR CDATA #REQUIRED
          VERSION CDATA #REQUIRED
          FORMAT CDATA #FIXED "2.0"
>
<!ELEMENT HEADER EMPTY>
<!ATTLIST HEADER
          MEDIA_FILE CDATA #REQUIRED
          TIME_UNITS (NTSC-frames | PAL-frames | milliseconds) "milliseconds"
>
<!ELEMENT TIME_ORDER (TIME_SLOT*)>
<!ELEMENT TIME_SLOT EMPTY>
<!ATTLIST TIME_SLOT
          TIME_SLOT_ID ID #REQUIRED
          TIME_VALUE CDATA #IMPLIED
>
<!ELEMENT TIER (ANNOTATION*)>
<!ATTLIST TIER
          TIER_ID ID #REQUIRED
          PARTICIPANT CDATA #IMPLIED
          LINGUISTIC_TYPE_REF IDREF #REQUIRED
          DEFAULT_LOCALE IDREF #IMPLIED
          PARENT_REF IDREF #IMPLIED
>
<!ELEMENT ANNOTATION (ALIGNABLE_ANNOTATION | REF_ANNOTATION)>
<!ELEMENT ALIGNABLE_ANNOTATION (ANNOTATION_VALUE)>
<!ATTLIST ALIGNABLE_ANNOTATION
          ANNOTATION_ID ID #REQUIRED
          TIME_SLOT_REF1 IDREF #REQUIRED
          TIME_SLOT_REF2 IDREF #REQUIRED
>
<!ELEMENT REF_ANNOTATION (ANNOTATION_VALUE)>
<!ATTLIST REF_ANNOTATION
          ANNOTATION_ID ID #REQUIRED
          ANNOTATION_REF IDREF #REQUIRED
          PREVIOUS_ANNOTATION IDREF #IMPLIED
>
<!ELEMENT ANNOTATION_VALUE (#PCDATA)>
<!ELEMENT LINGUISTIC_TYPE EMPTY>
<!ATTLIST LINGUISTIC_TYPE
          LINGUISTIC_TYPE_ID ID #REQUIRED
          TIME_ALIGNABLE CDATA #IMPLIED
          CONSTRAINTS IDREFS #IMPLIED
>
<!ELEMENT LOCALE EMPTY>
<!ATTLIST LOCALE
          LANGUAGE_CODE ID #REQUIRED
          COUNTRY_CODE CDATA #IMPLIED
          VARIANT CDATA #IMPLIED
>
<!ELEMENT CONSTRAINT EMPTY>
<!ATTLIST CONSTRAINT
          STEREOTYPE ID #REQUIRED
          DESCRIPTION CDATA #IMPLIED
>


References and links
 

1http://www.mpi.nl/DOBES
2http://www.volkswagen-stiftung.de
3http://www.mpi.nl/IMDI
4http://www.mpi.nl/world/tg/CAVA/CAVA.html
5http://www.mpi.nl/world/tg/lapp/eudico/eudico.html
6http://childes.psy.cmu.edu
7http://www.sil.org/computing/shoebox
8http://www.mpi.nl/COREX
9http://www.mpi.nl/tools


Steven Bird and Mark Liberman (2001) A formal framework for linguistic annotation. Speech Communication 33(1,2), pp 23-60.

Steven Bird and Gary F. Simons (2003), University of Pennsylvania & University of Melbourne, SIL International. Seven Dimensions of Portability for Language Documentation and Description.

Hennie Brugman and Peter Wittenburg (2001) The application of annotation models for the construction of databases and tools. IRCS Workshop on Linguistic Databases, University of Pennsylvania.

Sebastian Drude (2002) Advanced Glossing – a language documentation format and its implementation with Shoebox. International Workshop on Resources and Tools in Field Linguistics at LREC 2002


Program Readings Participants
Instructions for Participants
Workshop Homepage
Registration
Local Arrangements
Emeld 2001 Emeld 2002 Emeld Homepage