Font Size: a A A

The Research On Cross Language Text Categorization Based On Interlingua Semantic

Posted on:2009-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:W X BiFull Text:PDF
GTID:2178360272480755Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the network becomes the important source of the information, at the same time, information coming from governments, academic fields and business domains increases rapidly. These resources are multilingual knowledge base, however, a general condition is that people are customer to query using native language, it induce people to understand only a very few information. Because of the multilingual information and limitation that people skilled use language, language becomes one of the barriers when people get and use information.As one of the most effective text information management methods, Cross Language Text Categorization (CLTC) which can over come language barrier to help people to manage multilingual texts more quickly and easily turns up.Based-dictionary and machine translation technology were popular in the Cross language Text Categorization. The method of based-dictionary use bilingual dictionary to translate, but due to lack of context information and words have more than one meaning, it makes removal ambiguity of words difficultly. On the other hand, because of dictionary including not all words, such as people names, place names, so when translate these words would not find them in the dictionary. Besides, it is costly and infeasible when encounter large-scale corpus.Latent Semantic Indexing (LSI) was introduced to Cross Language Text Categorization which not used translation technology. It based content concepts, but the SVD complexity is still very high, and k value need do experiments repeatedly.To solve these problems, we present a new Cross Language Text Categorization model based on interlingua semantics, which modeling a unified framework that extracts the interlingua semantic pairs from the parallel bilingual corpus. The model principle and the results of the influence of feature dimension and interlingua semantics on the performance of the new Cross Language Text Categorization model are described in this thesis. In addition, we compare new model with mono-language text categorization, and the experiments show that new model have well performance.The main creative points of this thesis are: firstly, by extending PLS(partial least squares) principle, we propose a new cross language text categorization model; secondly, build some bilingual corpus, it is the foundation of building bilingual corpus in the future.
Keywords/Search Tags:Interlingua language, cross language text categorization, cross language information retrieval, partial least squares, latent semantics pairs
PDF Full Text Request
Related items