Font Size: a A A

Research On Cross-language Text Classification Technology

Posted on:2017-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:X Z LiuFull Text:PDF
GTID:2428330569999016Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Today's society is in the information age,the growth of information amount on the network is explosive,the speed it spreads,and the scale it develops,is to an unprecedented level.More extensively,more rapidly,and more accurately to access information,means that you can grasp the opportunities to gain more comprehensive knowledge,more valuable information,and greater economic benefits.Text classification is one of the main methods of processing text information.It can organize and manage information or knowledge in different categories to help people get information more quickly and accurately.However,in dealing with large-scale data and multi-lingual context of the text,the existing text classification methods show inefficient processing,poor processing and a series of problems,has been far from meeting the needs of people,cross-language text classification technology came into being.As an effective means of organizing and managing multilingual texts,it can overcome the barriers between different languages and make users more efficient in organizing and locating the information they need.This paper studies the methods of cross-language text classification.In view of the lack of multilingual parallel corpus in cross-linguistic text classification,linguistic barrier between different linguistic texts,subject drifting,poor classification efficiency and low efficiency,.etc,the corresponding solution is put forward as follows.First,multilingual parallel corpus is constructed by machine translation as the experimental dataset.Second,language barrier and subject drift are due to differences in the meaning,grammar and cultural background of different langua ges,which makes it difficult for different languages to communicate with each other,and the theme of the article transfers in the process of translation.In order to solve these problems,this paper introduces Word2 Vec training word vector tools into text representation,fully considers semantic information and context context information,projects words from different languages into the same vector space,and successfully crosses the barriers between different languages,and then solves the problem of language barrier and subject drift.Third,this paper proposes two new methods for cross-language text classification,which effectively improve the efficiency and effect of classification.Finally,this paper constructs a cross-language text categorization system.The two methods are applied to the cross-language classification based on the parallel corpus of Chinese-English-and-French parallel corpus,and obtains good results and efficiency.
Keywords/Search Tags:cross-language text classification, Multilingual parallel corpus, language barrier, theme drift, Word2Vec
PDF Full Text Request
Related items