Michael Cysouw, Jeff Good, Mihai Albu & Hans–Jöerg Bibiko, Max Planck Institute for Evolutionary Anthropology


Can GOLD "cope" with WALS? Retrofitting an ontology onto the World Atlas of Languages Structures

Background:
The World Atlas of Language Structures (WALS) is a large–scale "database of databases" consisting of 141 typological databases, covering a wide range of grammaticalfeatures, joined into one composite resource through the use of a common metadata scheme. Atpresent, a project is underway to "retrofit" an ontology onto this existing resource.

Research issues:
Three distinct research issues have been raised during the construction of the WALS ontology. Each is taken in turn. The following list of categories from WALS, for the grammatical feature of "Voicing and gaps in plosive systems", is useful for illustrating the first two.
  1. Missing /p/
  2. Missing /g/
  3. Missing both
  4. None missing in /p t k b d g/
  5. Other
(i) Non-canonical features: The feature set above makes use of what we term non-canonical features—in this case, features defined in terms of absence of some feature (here, /p/ or /g/). This differs from a canonical feature type where some set of data is described in terms of the presence of a well–defined feature. A number of non–canonical category types have been encountered in WALS, which have presented challenges to the development of the WALS ontology.

(ii) Implicit logical dependencies: The feature set above also exemplifies the problem of logical dependencies among features which are not explicitly encoded in WALS. In this case, the categories of "Missing /g/" and "Missing both" are treated as completely distinct when, logically, the two are related—since a language missing both /p/ and /g/ is also a language missing /g/.

(iii) Interdatabase category relationships: The third research issue raised during the development of the WALS ontology is encoding the relationships among similar categories in different WALS databases. For example, five of the WALS databases make use of a category along the lines of no case. In theory, the categorization of a language as having no case will be the same across all the databases. In practice, however, this is not always so. If such discrepancies are due to genuine disagreements about how to analyze a given language, then this problem cannot be "fixed" with an ontology. However, in some cases, the disagreement may result from subtle distinctions in how a term may be understood in different contexts, which is a problem that can be at least partially dealt with using an ontology.

Discussion:
The WALS ontology project has developed various techniques for dealing with these three issues. Non-canonical features are handled by using a controlled vocabulary for encoding attested non–canonical relationships. Implicit logical dependencies have been straightforwardly dealt with by redefining all terms in the database in terms of an ontology where the logical dependencies are explicitly encoded. Finally, interdatabase category relationships have been handled both by linking the database’s terms to a sufficiently rich ontology and by developing a controlled vocabulary for expressing "lateral" relationships among possibly deceptively similar terms appearing in multiple databases.