Skip to content
Show report in:

UMINF 12.03

Simulation relations as a means for pattern-matching in treebanks

We consider the problem of nding the occurrences of a pattern tree t in a treebank represented as a directed acyclic graph g, and propose a two-tiered technique, consisting of a preprocessing routine and a search algorithm. We assume that whereas the treebank itself is large and more or less static, the pattern tree is small and frequently updated. To model varying abstraction levels in the data, we work with partially ordered alphabets and compute simulation relations rather than equivalence relations. For instance, vertices and edges are labelled with elements from a pair of preorders S and P instead of unstructured alphabets. Under the above assumptions, we obtain a search algorithm that runs in time O(|S| + height(t)|t|(V/R)^2|P|), where (V/R) is the number of equivalence classes in the coarsest simulation relation Rg on the vertex set V of g. The size of the treebank thus only a ects the running time of the search algorithm indirectly, and this is due to the groundwork done by the preprocessing routine in time O(|S| + height (g)|g|j(V/R)^2|P|).


pattern matching, simulation, treebanks, complexity


Johanna Björklund and Lars-Daniel Öhman

Back Edit this report
Entry responsible: Johanna Bjorklund

Page Responsible: Frank Drewes