Danish Dependency Treebank

Annotation guide: Theory

Matthias T. Kromann
Department of Computational Linguistics
Copenhagen Business School

Syntax graphs

The main idea in dependency theory is that syntactic structure can be described as a set of typed relations between ordered pairs of words: a main word (called a head) and a subordinate word (called a subordinate). The syntactic analysis is encoded as a graph containing labeled nodes (representing words) and labeled directed edges (representing relations between words). Different levels of syntactic structure are encoded with different types of relations:

The three graphs below show three possible visual representations of the same syntax graph: the arc layout used in the Danish Dependency Treebank, the layout used in classical dependency theory, and the phrase-structure layout used in discontinuous phrase-structure theories.

Dependency graph (arc
			layout): Peter will paint the wall today        Dependency graph (classical
			dependency layout): Peter will paint the wall today        Phrase-structure graph
			(classical layout): Peter will paint the wall today

The treebank uses the arc layout on the left, where relations are drawn as circular arrows from heads to subordinate words, with a label below each arrow head that indicates the type of the relation (ie, "subj"=subject, "dobj"=direct object, etc). Arcs that encode primary dependencies (ie, complements and adjuncts without fillers, and gapping dependents) are shown above the words, arcs that encode fillers, landing sites, and coreference are displayed below the words. The word class is indicated with a PAROLE tag below each word (eg, NC, VA, etc).

Compared to the classical dependency layout and the phrase-structure layout, the arc layout has several advantages, listed below:

Phrase-structure layout and classical layoutArc layout
Words with multiple incoming edges impossible no problem
Cyclic relations impossible no problem
Discontinuities awkward: it is hard to design good layout algorithms for discontinuous trees that ensure that discontinuous edges do not cross nodes in the tree, and that edge labels do not collide with node labels; in classical dependency graphs, this problem is solved by reordering the words, but then the original word order is lost no problem: node labels and edges are drawn in separate areas in the drawing, so arcs may cross in the drawing, but they never collide with node labels
Multi-line and multi-page layout for long sentences awkward: tree depth is often proportional to sentence length, so very long sentences tend to result in deep trees that are difficult to split across several lines and pages (ie, classical dependency graphs and phrase-structure graphs tend to be two-dimensional) no problem: arc height rarely exceeds 10, even for very long sentences and texts, so long sentences result in arc graphs that are relatively flat and hence easy to split across several lines and pages (ie, arc graphs tend to be one-dimensional)
Empty categories no problem no problem
Phrasal nodes no problem awkward (but not used in a dependency framework)

Since our linguistic theory uses multiple heads, cyclicity, discontinuities, and empty categories, but not phrasal nodes, the arc layout is the best choice of graphical representation in our framework -- especially because sentences in our corpus can be quite long (up to 70-90 words), and because we intend to eventually encode entire discourse structures in one graph, ie, we eventually need to create graphs that span entire texts. One relative advantage of the classical dependency layout and the phrase-structure layout is that most linguists are familiar with them. But the arc layout has been used in a few syntax theories, including Word Grammar.

Valency: complements and adjuncts

Deep syntactic relations are called dependencies. The head of the dependency (the governor) determines the syntactic and semantic type of the combined phrase, whereas the subordinate word (the dependent) may have a completely different syntactic and semantic type. The dependency must be licensed in the lexical entry of either the governor or the dependent. If the dependency is lexically licensed by the governor, then the dependent is called a complement; if the dependency is lexically licensed by the dependent, then the dependent is called an adjunct. Semantically, adjuncts act as functors to their governors, whereas complements act as arguments to their governors. This gives the following schematic translation from dependency trees (or deep trees) to functor-argument trees (note that the translation is one-many, since the modifiers could have been applied in the opposite order):

Dependency tree
				with corresponding functor-argument trees

There is no clear-cut distinction between complements and adjuncts (cf. Helbig & Schenkel 1971; Helbig 1992). Prototypical complements include subjects, direct objects and indirect objects (and other case-marked dependents), and prototypical adjuncts include time and place adverbials. Optionality is sometimes mentioned as a criterion to distinguish between complements and adjuncts, because adjuncts tend to be optional (ie, they can be deleted without leaving the phrase ungrammatical), whereas complements tend to be obligatory (ie, they cannot be deleted). But there are borderline cases where this rule of thumb does not apply. For example, within an appropriate context, most complements are optional (eg, objects can be omitted if they can be inferred from the context, and in telegraphic sentences, even the subject can be omitted).

Our view is that complements and adjuncts represent different mechanisms for encoding a linguistic construction in the grammar, but with different consequences for the size of the grammar. The grammar-writer (and the human brain during language acquisition) have the objective of making the grammar as economical as possible by choosing the mechanism that will minimize: (1) the total size of the grammar (measured by the number of times a rule is stated and the size of each rule); and (2) the total number of complement or adjunct rules associated with each word. Having a large grammar carries a large memory cost, and having too many rules associated with one word carries a large processing cost. In most cases, one of the two mechanisms is much more economical than the other, but in borderline cases, the difference in economy is so small that the adjunct and complement mechanisms are equally acceptable. The main prototypical characteristics of complements and adjuncts are summarized below:

Adjunct mechanismComplement mechanism
Optionality Adjuncts tend to be optional. Complements tend to be obligatory.
Uniqueness A governor can have several adjuncts of the same type. A governor can have one complement of each type only (with a proper definition of "type": a verb may have more than one prepositional object ("pobj"), as long as the prepositions are different).
Selection The possible governors are characterized by having a large easily definable syntactic and/or semantic class. The possible complements are characterized by having a large easily definable syntactic and/or semantic class.
Semantics The adjunct has a natural rule for computing the interpretation of the combined phrase, given the semantic representation of the governor as its argument. The governor has a natural rule for computing the interpretation of the combined phrase, given the semantic representation of the complements as its argument.

Example 1. To illustrate these ideas, we will analyze the sentence "Peter will paint the wall today" within a dependency framework. This means that we have to hypothesize complement and adjunct rules for all the words involved, and test whether these rules generate all grammatical sentences in English (and only those sentences). The first step is to look for the grammatical category of each phrase: ie, "Peter" and "wall" are nouns, "the" is a pronoun (or determiner, if you like), "will" and "paint" are verbs, and "today" is an adverb. The second step is to look for minimal construction schemata, like: "X paints Y", "X will Y", "the X", etc. These often correspond to complement schemata, so it would be natural to hypothesize the following complement structures:

From these lexical rules, we can deduce that the only possibility is that "will" must be the governor of the whole sentence, with "paint" as its verbal complement and "Peter" as its subject complement (note that "Peter" agrees with "will", but not with "paint": "Peter paints/*paint the wall"); that "Peter" must be the subject of "paint" as well (so "Peter" is the complement of two verbs, but without subject agreement with the second verb "paint"), and that "the wall" is the direct object of "paint"; and, finally, that "wall" is the noun complement of "the". The only remaining word is "today". It does not seem to be a complement of "paint" or "will", since it can occur with any verb (and indeed most nouns as well); nor can it be the head of the whole sentence, because it does not have the same syntactic or semantic type as the whole sentence (for example, you can say: "I wonder whether X" where X can be "Peter will paint the wall today", "Peter will", "Peter paints the wall", but you cannot say "*I wonder whether today"). Thus, the only remaining possibility is to analyze "today" as an adjunct of "paint" (or perhaps "will"). This gives the following dependency graph:

Peter will paint the
					wall today

Note that in most cases, other analyses are possible as well. For example, many linguists analyze "the man" as a phrase headed by "man" rather than "the" (we will argue against this analysis later). The choice between the different analyses is a question about hypothesizing an underlying grammar for each of the analyses, and choosing the analysis whose corresponding grammar leads to the most precise predictions of what is correct English -- ie, the analysis that seems most capable of generating all sentences deemed grammatical by native speakers, and excluding all sentences deemed ungrammatical by native speakers).

Fillers

There are many constructions where a word has more than one governor, ie, both a primary governor and one or more secondary governors for which it provides a filler (semantic variable). For example, in the sentence "He has seen it", "he" is the subject of the main verb "has", but in some way also acts as subject for the subordinate verb "seen"; and in the sentence "This is the man we know", "the man" is the subject of "is", but in some way also acts as direct object of "know". In the treebank, we take a slightly simplified view where a word can have secondary dependencies (called filler dependencies) that provide a "copy" (ie, semantic variable) of the word to the secondary governors. Filler dependencies are shown below the words, with a dependency label of the form "[role]". The square brackets indicate that the dependency is a filler dependency, and role states the dependency role the filler has in the secondary governor (eg, "[subj]" denotes a subject filler dependency). Some examples of treebank analyses with filler dependencies are shown below:

han har set det      han vil have set det      ham vi kender      hvem hun kender      hvordan de gør

Fillers are used in a wide range of constructions, including:

The treebank analysis is a simplification of the analysis of fillers in Discontinuous Grammar. The treebank analysis is more intuitive than its counterpart in Discontinuous Grammar because it corresponds more directly to a semantic level, and less complicated because it replaces a phonetically empty node and two dependencies with one filler-dependency. However, from a technical point of view, the DG analysis is more precise because it accounts for the licensing conditions for fillers and their interaction with island constraints. The DG analysis does this by distinguishing between a phonetically empty filler, a filler licensor that licenses the filler, and a filler source in the local neighbourhood of the filler licensor (ie, a neighbour of the filler licensor in the dependency graph) that is used as a forced antecedent for the filler. The three graphs below left show the technically precise DG analyses corresponding to the first three treebank graphs shown above. The "ref" dependency goes from the filler source to the filler, the "fill" dependency goes from the filler licensor to the filler. The fourth graph shows a relative clause where the filler's governor and licensor do not coincide.

han har set det      han vil have set det      ham vi kender      ham vi har kendt

Anaphora

In the treebank, an anaphoric reference between a pronoun and its antecedents can be indicated by a dependency that goes from the antecedent to the anaphoric element, labeled with "ref"; the dependency is usually drawn below the words. Anaphoric reference is currently only specified for purely syntactically determined anaphoric references, such as relative pronouns or wh-pronouns that refer to a relativized dependent within a relative clause. A few examples are shown below:

ham der synger      ham på hvem vi ser      Jeg har lavet kage, som du bad mig om.     

We have chosen not to specify anaphoric references in any other constructions. There are four reasons for this: (1) it would be too time-consuming to specify antecedents for all pronouns and definite noun phrases; (2) there are usually many possible antecedents for each pronoun, which leads to structural ambiguity, and human annotators often have difficulties agreeing on what is the "right" antecedent; (3) it is not clear that general anaphoric reference is a syntactic phenomenon -- for example, pronouns may refer to objects in the physical context that have not been introduced into the linguistic discourse, so perhaps anaphoric resolution is part of a general semantic reasoning process that involves both linguistic and non-linguistic information; (4) the specification of anaphoric references requires a theory of semantics, pragmatics and anaphoric reference, and we currently don't have such a theory.

The treebank analysis of relative clauses that contain a relative pronoun is a simplification of the analysis in Discontinuous Grammar. The main difference is that in DG, there is no direct "ref" dependency between the relative pronoun and the relativized noun. Instead, there is a filler licensed by the relative verb that acts as the complement of the relative pronoun, in analogy with the analysis of relative clauses without a relative pronoun. This is shown in the examples below:

ham der synger      ham på hvem vi ser      barnet hvis mad vi anretter      Jeg har lavet kage, som du
			bad mig om.      drengen i munden på hvem vi kiggede

Gapping dependents

In gapping coordinations (such as "John loves coffee, and Mary tea"), the second conjunct has a phonetically empty head that may take the same complements and adjuncts as the first conjunct. In the treebank, where empty heads are not allowed, dependents of a phonetically empty head must be treated as special elliptic dependents of the governor of the elided head, indicated with edge label "<edge>". Some examples are shown below:

Kaffe skænker han til os, og
			hun til dem       Vi serverer Anne te, Bo
			kaffe, Clara kakao, og Dorthe mælk

The treebank analysis is a simplification of the analysis in Discontinuous Grammar, where the gap is represented by a filler node with a special "gap" dependency to the head of the ungapped conjunct, as shown below.

Kaffe skænker han til os, og
				hun til dem       Vi serverer Anne te, Bo
				kaffe, Clara kakao, og Dorthe mælk


Syntactic type shifting

There are many instances of words that clearly belong to one word class, but can be type shifted to another word class in certain circumstances. For example, given the right context, any adjective in Danish can be used as a noun (eg, as a subject, direct or indirect object, or nominal object), and some adjectives -- including words like "flere" ("more"), "ældre" ("older" = "the old people"), "unge" ("young" = "the young people"), and "gående" ("walking" = "walking people") have acquired an idiomatic use as nouns. In these cases, the PAROLE tag is preserved, but the word is allowed to take on a dependent role that is normally preserved for other word classes (eg, "subj" can be applied to adjectives that function as nouns).

Adjuncts license their adjunct governors in the lexicon, and act as functors to their adjunct governors during semantic interpretation -- ie, the lexical entry for the adjunct usually includes a description of the syntactic and semantic category of the governor. Thus, we could plausibly posit a type-shifting mechanism where an adjunct is allowed to pose externally as if it actually was the governor, ie, to take over the governor's word class and other syntactic properties externally (ie, when acting as a complement to other words), while keeping its own word class internally (ie, it still looks like the old original adjunct to its complements). The missing governor argument could be provided by a default (eg, "people"), or retrieved anaphorically from the context. Technically, this could be done by requiring a type-shifted adjunct to always generate an anaphoric filler, which is bound as filler object to the adjunct, and which requires a semantic antecedent in the (linguistic or non-linguistic) discourse during interpretation. This analysis is exemplified below, where the anaphoric filler generated by "grønne" has selected "æbler" as its antecedent:

Røde æbler er bedre end grønne.

Word classes

The PAROLE tagset contains three morphologically defined word classes:

Gerundium occurs with almost all verbs by addition of "+en" (eg, "skaben", "undren", "løben", "syngen", "laven mad"). The full PAROLE word class tagset for Danish is shown below:

Here is a graphical diagram of our organization of the PAROLE word classes (the word class CS is not recognized as a separate word class and its member words are mostly analyzed as prepositions). Moreover, cardinal adjectives are analyzed as cardinal numerals, a special group of pronouns.

our PAROLE word class hierarchy

Edge classes

The treebank makes use of the following complement edges, shown along with the word classes that typically act as governors and complements in such constructions ("?" indicates that any word class may be used): and the following adjunct edges: In addition, there are edges used for encoding landing sites, coreference, fillers, and gaps: The following edges are not used in the tagging, but are reserved for future annotations: The edges used in the treebank are ordered in a type hiearchy:

Edge type hierarchy

Complements are subdivided into the following types:

Adjuncts are subdivided into the following types:

Principles for selecting the best analysis

In many cases, more than one analysis is possible. In that case, the following principles will be used to select the "best" analysis: Here are some tests that should be considered when weighing two analyses against each other:

Software for manipulating and searching the treebank

The Danish Dependency Treebank has been created with the DTAG treebank tool. DTAG also allows the user to search a treebank for syntactic constructions, specified with a constraint-based query language. We are currently working on extending DTAG with modules for learning a massively probabilistic dependency grammar from a treebank, and for parsing texts with a massively probabilistic dependency grammar.

File format for saving analyses

There are obviously many ways to store the dependency analyses in the treebank. So far, the annotation program implements a very primitive XML-like data format. An example is shown below:

<W msd="PP" gloss="He" in="1:subj|2:[subj]" out="">Han</W>
<W msd="VA" gloss="has" in="" out="-1:subj|1:vobj">har</W>
<W msd="VA" gloss="seen" in="-1:vobj" out="-2:[subj]|1:dobj">set</W>
<W msd="PD" gloss="it" in="-1:dobj" out="">det</W>
	

The word itself is enclosed within the <W> tags, and the attributes of the <W> tag are used to encode arbitrary variables associated with the word, as well as incoming and outgoing edges to other words. The reserved attribute names are shown below:

The annotation format is an extension of the notation used in the Danish PAROLE corpus. The tagging software treats all non-<W> tags as comments, so these tags are left unchanged by the software.

In the future, we would like to add support for TIGER XML. In the distant future, we would also like to add support for the ATLAS interchange format.


http://www.id.cbs.dk/~mtk/dtag/ddt/theory.html last updated by Matthias T. Kromann at 2004-08-03 15:49