Font Size: a A A

Cross-language Text Classification Research Based On Latent Semantic Dual Space

Posted on:2011-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:C XiongFull Text:PDF
GTID:2178330332965620Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
According to statistics, Web Pages in the Internet have already reached 100 million, and grow as the rate of millions of pages a day. How to find relevant information in an open internet database and overcome language barriers become more and more difficult. Therefore, using automatic classification and information retrieval to deal with a large number of multi-language texts has become particularly important.In the large-scale text processing, classification is mainly used for organizing text, especially for the vast amounts of text information resources. Using classification to organize the relevant text is convenient for text processing and discovery of effective new knowledge patterns. Nowadays, web pages in various kinds of language are very rich, appearance of cross-language text classification is better for people to share the multilingual Internet Information Resources.With the trend of Information resources becoming multilingual, Cross-language text classification technology is continuously developing. At present, the methods of Cross-language study are mainly based on Machine translation, bilingual dictionary and corpus-based methods[1].Machine translation method requires long time to process, large amount of calculation and increases the computing load. In addition, the level of machine translation is still low, and the correct rate should be improved. Most studies are based on bilingual dictionary and word selection method [2]. The corpus-based approach is using the large-scale corpus and extracting the required information from corpus, automatically structure application related to translation technology [3], as to solve the problem of ambiguity and insufficient words'coverage. Therefore, Corpus-based Latent Semantic Indexing [6] is introduced into the cross-language classification. The results are greatly improved, but need to build cross-language similarity matrix between the words, the cost of space and time is great.This paper is based on corpus, using the theory of statistics [7,8], study the model of cross-language text classification based on latent semantic dual space. Using partial least squares method extracts latent semantic from document feature matrix, thus building the latent semantic dual space. Corpus labeled project to this space, then train the classifier. This method uses concept to express the multi-language document, in order to avoid the ambiguity caused by the translation, achieve the aim of overcoming language barrier.In the experiment, through changing the training size and language composition, results show that cross-language text classification based on latent semantic dual space makes a good performance in stability and accuracy.
Keywords/Search Tags:Cross Language Text Classification, latent semantic dual space, semantic pairs, parallel corpus
PDF Full Text Request
Related items