Data Fusion via Manifold Matching

Embargo until
Date
2013-10-23
Journal Title
Journal ISSN
Volume Title
Publisher
Johns Hopkins University
Abstract
Data fusion has been an interesting and challenging research topic, which receives intensive investigation in many areas. The theory and practice of data fusion - the integration of information from disparate sources - are of burgeoning interest in the modern world. This dissertation has pursued an important and timely aspect of the general data fusion problem - namely, manifold matching. The work in this dissertation lies in the general area of machine learning. The dissertation presents a development of manifold matching for subsequent statistical inference. Manifold matching works to identify embeddings of multiple disparate data spaces into the same low-dimensional space, where joint inference can be pursued. Three manifold matching methods considered for our work are: Procrustes o Multidimensional Scaling (Proc o MDS), Canonical Correlation Analysis o Multidimensional Scaling (CCA o MDS), and Joint Optimization of Fidelity and Commensurability (JOFC). We apply manifold matching to different inference tasks. Particularly we are interested in two tasks: one is cross-language text classification, which classifies text documents using classifiers trained on documents in a different language from the classification target documents'; the other inference task is graph vertex nomination, to detect a subset of vertices with an interesting attribute in a given graph. Experimental results show that by integrating information from additional spaces properly, the inference performance can be improved. We are also interested in the variabilities in Latent Dirichlet Allocation (LDA), a widely used model for topic modeling. We investigate the inferential variability of LDA for two inference methods: variational Expectation-Maximization (variational EM) and Gibbs sampling. Our method quantifies the inferential variability of LDA based on adjusted rand index and multidimensional scaling, which results in a d-dimensional representation with points representing multiple LDA repetitions. We also demonstrate significant effects of this inferential variability inherent in LDA on the performance of the downstream graph vertex nomination inference task.
Description
Keywords
manifold matching, cross-language text classification, graph vertex nomination, inferential variability
Citation