Data Fusion via Manifold Matching

Sun, Ming

Data Fusion via Manifold Matching

Date

2013-10-23

Authors

Sun, Ming

Publisher

Johns Hopkins University

Abstract

Data fusion has been an interesting and challenging research topic, which receives intensive investigation in many areas. The theory and practice of data fusion - the integration of information from disparate sources - are of burgeoning interest in the modern world. This dissertation has pursued an important and timely aspect of the general data fusion problem - namely, manifold matching. The work in this dissertation lies in the general area of machine learning. The dissertation presents a development of manifold matching for subsequent statistical inference. Manifold matching works to identify embeddings of multiple disparate data spaces into the same low-dimensional space, where joint inference can be pursued. Three manifold matching methods considered for our work are: Procrustes o Multidimensional Scaling (Proc o MDS), Canonical Correlation Analysis o Multidimensional Scaling (CCA o MDS), and Joint Optimization of Fidelity and Commensurability (JOFC). We apply manifold matching to different inference tasks. Particularly we are interested in two tasks: one is cross-language text classification, which classifies text documents using classifiers trained on documents in a different language from the classification target documents'; the other inference task is graph vertex nomination, to detect a subset of vertices with an interesting attribute in a given graph. Experimental results show that by integrating information from additional spaces properly, the inference performance can be improved. We are also interested in the variabilities in Latent Dirichlet Allocation (LDA), a widely used model for topic modeling. We investigate the inferential variability of LDA for two inference methods: variational Expectation-Maximization (variational EM) and Gibbs sampling. Our method quantifies the inferential variability of LDA based on adjusted rand index and multidimensional scaling, which results in a d-dimensional representation with points representing multiple LDA repetitions. We also demonstrate significant effects of this inferential variability inherent in LDA on the performance of the downstream graph vertex nomination inference task.

Keywords

manifold matching, cross-language text classification, graph vertex nomination, inferential variability

URI

http://jhir.library.jhu.edu/handle/1774.2/37023

Collections

ETD -- Doctoral Dissertations

Full item page