Font Size: a A A

Research On Bilingual Topic Model And Its Algorithm In Cross-language Information Retrieval

Posted on:2014-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S LuoFull Text:PDF
GTID:1228330398992844Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the acceleration of globalization,information resource in the Internet is no longer expressed by English and othercommon languages. The need of searching information in non-native language isincreasing. The Internet having multi-language resource and the users not beingskilled in non-native language inevitably bring language barriers to the Internet users.Cross-language Information Retrieval (CLIR) is an effective way to represent, store,organize and access multi-language information. It is a challenging and cutting-edgefield in information retrieval (IR).Cross-language Information Retrieval addresses the search problem in whichretrieving the documents in one language by querying in another language. The key tothe problem is how to build the semantic relationship between the query in sourcelanguage and the document in target language. The topic model has become aneffective method in CLIR. It also has drawn attention to researchers in machinelearning, information retrieval, nature language processing and so on in recent years.The thesis focused on CLIR model, cross-language text categorization method (CLTC)and cross-language text clustering method (or multi-language text clustering) basedon bilingual topic. These models or methods can effectively address the problems ofmulti-meaning in translation and partly solve the problem of unknown wordtranslation. The main research findings of this thesis can be summarized as follows:(1) A CLIR framework based on bilingual topic spaceNatural language is regarded as meaning symbol strings to describe semanticobjects in real world. Multi-language text is multiple views for the object. The viewsare semantically equivalent. Based on the assumption that the topics in a parallel textshare the same semantic meanings across languages, the topics are sampled from thesame topic document distribution. We propose a CLIR framework based on bilingualtopic space. In the framework, the semantic meanings shared by parallel documents are extracted based on partial least square (PLS) method and topic space is built tomodel the semantic relationship cross languages.The topic space for each language is constituted of the topics extracted frombilingual parallel corpus. Each topic space is independent. The bilingual topic spacemodels the semantic relationship between languages. The space is a abstract conceptspace. It reveals that the relationships of semantic correspondence between documents,between documents and terms, between terms. It also uncovers that the inherentconstruction and internal relations in corpus. Mathematically, the two topic spaces areapproximately equivalent. The tasks of cross-language information retrieval, cross-language text classification and cross-language text clustering can be conductedwithout directly translating or bilingual dictionary after query or document isprojected onto the bilingual topic space.(2) Construction of a Chinese-English parallel corpus for CLIRCorpus is an important basic data resource for CLIR. It is used for evaluation,translation and construction of bilingual dictionary for CLIR.We collected bilingual news stories from Websites of Wall Street Journal,Financial Times and Hong Kong government news to construct CLIR evaluationcorpus, bilingual parallel corpus and CLTC evaluation corpus. The steps forconstructing corpus include selecting parallel webpages, pretreating document,aligning passage, labeling classes of documents, building query set and judgingdocument relevance. TREC-9document set for CLIR was translated by Google API1.0interface program to create bilingual parallel corpus of TREC-9.(3) A CLIR model based on topic dual spaceIn cross-language latent semantic indexing model (CL-LSI), each pair ofdocument is concatenated into a dual document and the semantic relationship betweenlanguages is captured by exploiting co-occurrence of terms cross languages. However,the mixture of documents does not fully consider inherent feature and semanticcorrelation cross language. Based on the assumption that the topics in a paralleldocuments share the same topics, we present a method to represent bilingual topicspace using a linear latent semantic dual space. The two topic spaces in the bilingual topic space are linear function space and dual. Each pair of topic is semanticallyindependent. So we propose a topic dual space model for CLIR (TDS). TDS modelcan get the co-occurrences terms in parallel documents and build statisticaldependencies.Experiments on self-designed bilingual corpus demonstrate that TDS model cansearch97.00%of translated counterparts and correctly translated words. Experimentalresults on in-house dataset indicate that TDS outperforms CL-LSI in mate search andcross-language information retrieval. TDS is a language-independent model in mono-and cross-lingual retrieval, and can extract bilingual topics having themecharacteristic and bilingual semantic relationship. Evaluations on the bilingual corpusTREC-5&6and TREC-9show that our model in mono-and cross-lingual retrievaltasks outperforms CL-LSI.(4) A CLIR model based on bilingual topic correlationHow to extract cross-language semantic meaning from bilingual paralleldocuments is important to improve CLIR. The matrices for the two languages in TDSmodel are regarded as predictive relationship. They are asymmetric and were notequally treated. Its time and space complexity are proportional to the number ofdocuments. Therefore, TDS model cannot effectively process large-scale documentset. Bilingual parallel documents share the same topics, which are semanticallycorrelative. We propose a new bilingual topic correlation model (BiTC) for CLIR. Themodel views the parallel documents as two different lingual representations for thesame semantic contents and builds a single topic space for each language frombilingual parallel corpus. Cross-lingual information retrieval is conducted in thesenew topic spaces. The new model overcomes the deficiency of the CL-LSI that doesnot fully take into account bilingual semantic relationship.Experimental results on the aligned Chinese-English news collection show thatBiTC significantly outperforms CL-LSI in mate search and cross-lingualpseudo-query retrieve and better performs on TREC-9bilingual parallel corpustranslated by Google Translation.(5) A cross-lingual text categorization/clustering method based on bilingual semantic corresponding analysisBilingual text corresponding analysis can help to bridge the language barrier ofcross-lingual corpora. Cross-lingual latent semantic indexing corpus-based does notfully take into account bilingual semantic relationship. The paper proposes a newmethod building semantic relationship of bilingual parallel document via partial leastsquares. In this method, the parallel documents are viewed as two different lingualrepresentations for the same semantic content, such that a unify latent semantic spacecan be constructed for two languages. The task of cross-lingual text categorization isperformed in the new bilingual latent semantic spaces.The Chinese-English document-aligned dataset for evaluating is collected fromthe Hong Kong government news website. Experimental results on the task of mono-and cross-lingual classification show that performance of the presented method isover or near to mono-lingual classification in the original feature spaces.The contributions of the thesis can be summarized as follows.(1) We propose a CLIR model based on topic dual space model (TDS). Themodel uses a linear semantic dual space to construct bilingual topic space to addressthe problem that each pair of document is concatenated into a dual document inCL-LSI. TDS model can get the co-occurrences terms in parallel documents and buildstatistical dependencies to translate and query expansive.(2) We present a bilingual topic correlation model in CLIR (BiTC). It is assumethat bilingual parallel document shares semantic correlated topics. BiTC modelconstructs a single topic space for each language from bilingual parallel corpus tobuild bilingual semantic relationship. The new model addresses the problems of notfully considering bilingual semantic relationship in CL-LSI and not effectivelyprocess large-scale data.(3) We propose a cross-lingual text categorization/clustering method based onbilingual semantic corresponding analysis (BiSCAN). To address the problem of notfully considering multiple correlations and construction information in CL-LSI,BiSCAN constructs a single low-dimension topic space for each language and buildbilingual semantic corresponding relationship. The performance of CLTC and MLDC using BiSCAN is over or near to mono-lingual classification in the original featurespaces.
Keywords/Search Tags:cross-language information retrieval, cross-language text categorization, cross-language text clustering, topic model, bilingual topic model, partial leastsquares
PDF Full Text Request
Related items