Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
Embargo until
Date
2006-10-30T20:23:50Z
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Johns Hopkins University
Abstract
This thesis is about estimating probabilistic models to uncover useful hidden structure
in data; specifically, we address the problem of discovering syntactic structure in natural
language text. We present three new parameter estimation techniques that generalize
the standard approach, maximum likelihood estimation, in different ways. Contrastive
estimation maximizes the conditional probability of the observed data given a “neighborhood”
of implicit negative examples. Skewed deterministic annealing locally maximizes
likelihood using a cautious parameter search strategy that starts with an easier optimization
problem than likelihood, and iteratively moves to harder problems, culminating in
likelihood. Structural annealing is similar, but starts with a heavy bias toward simple
syntactic structures and gradually relaxes the bias.
Our estimation methods do not make use of annotated examples. We consider their
performance in both an unsupervised model selection setting, where models trained under
different initialization and regularization settings are compared by evaluating the training
objective on a small set of unseen, unannotated development data, and supervised model
selection, where the most accurate model on the development set (now with annotations)
is selected. The latter is far superior, but surprisingly few annotated examples are required.
The experimentation focuses on a single dependency grammar induction task, in
depth. The aim is to give strong support for the usefulness of the new techniques in one
scenario. It must be noted, however, that the task (as defined here and in prior work) is
somewhat artificial, and improved performance on this particular task is not a direct contribution
to the greater field of natural language processing. The real problem the task seeks
to simulate—the induction of syntactic structure in natural language text—is certainly of
interest to the community, but this thesis does not directly approach the problem of exploiting
induced syntax in applications. We also do not attempt any realistic simulation of
ii
human language learning, as our newspaper text data do not resemble the data encountered
by a child during language acquisition. Further, our iterative learning algorithms
assume a fixed batch of data that can be repeatedly accessed, not a long stream of data
observed over time in tandem with acquisition. (Of course, the cognitive criticisms apply
to virtually all existing learning methods in natural language processing, not just the new
ones presented here.) Nonetheless, the novel estimation methods presented are, we will
argue, better suited to adaptation for real engineering tasks than the maximum likelihood
baseline.
Our new methods are shown to achieve significant improvements over maximum
likelihood estimation and maximum a posteriori estimation, using the EM algorithm, for
a state-of-the-art probabilistic model used in dependency grammar induction (Klein and
Manning, 2004). The task is to induce dependency trees from part-of-speech tag sequences;
we follow standard practice and train and test on sequences of ten tags or fewer. Our
results are the best published to date for six languages, with supervised model selection:
English (improvement from 41.6% directed attachment accuracy to 66.7%, a 43% relative
error rate reduction), German (54.4 ! 71.8%, a 38% error reduction), Bulgarian (45.6% !
58.3%, a 23% error reduction), Mandarin (50.0% ! 58.0%, a 16% error reduction), Turkish
(48.0% ! 62.4%, a 28% error reduction, but only 2% error reduction from a left-branching
baseline, which gives 61.8%), and Portuguese (42.5%!71.8%, a 51% error reduction). We
also demonstrate the success of contrastive estimation at learning to disambiguate partof-
speech tags (from unannotated English text): 78.0% to 88.7% tagging accuracy on a
known-dictionary task (a 49% relative error rate reduction), and 66.5% to 78.4% on a more
difficult task with less dictionary knowledge (a 35% error rate reduction).
The experiments presented in this thesis give one of the most thorough explorations
to date of unsupervised parameter estimation for models of discrete structures. Two sides
of the problem are considered in depth: the choice of objective function to be optimized
during training, and the method of optimizing it. We find that both are important in unsupervised
learning. Our best results on most of the six languages involve both improved
objectives and improved search.
The methods presented in this thesis were originally presented in Smith and Eisner
(2004, 2005a,b, 2006). The thesis gives a more thorough exposition, relating the methods
to other work, presents more experimental results and error analysis, and directly
compares the methods to each other.