Paul Trilsbeek & Peter Wittenburg, Max Planck Institute for Psycholinguistics


Language Resource Archiving at the MPI — management and utilization

The MPI for Psycholinguistics has an increasingly large archive for language resources covering different types of resources and contributions from very different types of projects, such as endangered languages documentation, gesture studies, child language acquisition studies, Dutch Spoken Corpus and many others. To be able to organize and manage this corpus, the MPI developed the IMDI infrastructure and recently launched the first version of LAMUS (Language Archive Management and Upload System), which is a content management system specially designed for language resources. LAMUS gives the users as well as the managers the possibility to upload all types of resources into the archive while maintaining consistency and coherence and it will allow the MPI to open up its archive for contributions from individual researchers and projects not directly linked with the MPI. At the physical layer, the MPI takes care that several copies are created dynamically at different places in Europe.

Understanding that the language resources contained in an archive such as the one at MPI are the “gold” for the current and future scientific work, the MPI team currently works on optimizing the access infrastructure. Access has to be given at different layers and in principle it should be possible to utilize all archival objects with user-made or chosen software. On the metadata level, a large variety of options are available. Looking under Google for example for “IMDI Teop” will yield the top node in the browsable corpus allowing users to move on. It is also possible to copy whole sub-trees to generate new and complete archives. However, users should also be able to access all content with the help of web–based methods. Currently, lexica can be viewed and modified with the LEXUS tool and annotated media can be viewed with the ANNEX tool. First interactions between these two domains have been realized. This work will continue to create a useful web-based analysis and enrichment infrastructure.

LEXUS and ANNEX already include the possibility to carry out searches across many resources coming from different projects and including different terminologies (tags, values), therefore it is necessary to make mappings between these terminoligies. The approach taken by the MPI is to combine the advantages of bottom–up with those of top–down strategies. Central ontologies in this sense are seen as an excellent reference framework for semantic interoperability, but are too inflexible to meet the needs of the individual researcher. Concept and relation registries created from the resources can overcome these limitations. It is intended to present a rich web–based analysis framework at the end of 2005.