Font Size: a A A

A Study On The Analytical Method Of Chinese And Vietnamese Bilingual News

Posted on:2016-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:J C WangFull Text:PDF
GTID:2208330470970752Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Vietnam is adjacent to Yunnan Province of China and there is increasing communication between Vietnam and China under the circumstance of "bridgehead" strategy. Since text is the main carrier of information, the bilingual processing between Chinese and Vietnamese becomes particularly important. There are Chinese and Vietnamese pages containing abundant semantic information in Wikipedia, which are important resources for cross language text processing. This thesis researches the understanding of Chinese and Vietnamese bilingual texts according to text differences between Chinese and Vietnamese and puts forward semantic relativity calculation between the two languages based on Wikipedia. Meanwhile, it puts forward text similarity calculation between Chinese and Vietnamese based on the bilingual topic distribution of words. At last, it analyzes news topics in combination with entity co-occurrence information in network news text based on cross language text similarity. The concrete research contents are as follows:1. Concepts in Wikipedia can characterize words. Making use of this feature, this thesis uses concepts of Wikipedia as a vector space to realize vector representation of words. Each pair of the Chinese and Vietnamese words to be calculated is represented by a high latitude vector. Each dimension corresponds to a concept in Wikipedia in the respective language. The weight vector is the weight of this word in the corresponding concept page. Cross language translation between Chinese and Vietnamese is done according to Wikipedia Multilingual list. Semantic relativity of Chinese and Vietnamese words is achieved through similarity calculation of the vector gained by Lesk algorithm.2. Different texts contain different subject distribution. The similarity of subject distribution can represent the similarity degree of texts. The calculation of text similarity is done then. The same concept pages in Chinese and Vietnamese Wikipedia can be used to do bilingual theme simulation training. The bilingual theme information gained can be used to predict the topic distribution of the bilingual document and extract the topic frequency-inverse document frequency of the new document. The similarity calculation of the distribution gained is done through the KL divergence and cosine similarity. The similarity of the two distributions is the similarity of the two texts.3. Analyze the topic of Chinese and Vietnamese bilingual news containing key elements and topics. Fetch Chinese and Vietnamese bilingual news text based template method. Calculate text similarity by use of text similarity calculation method for Chinese and Vietnamese text mentioned above. Get the keywords of news titles by segmentation and stop word processing of titles. Extract entity information from Chinese and Vietnamese bilingual news texts. Use Wikipedia Multilingual list to do equivalence translation of keywords and entity information. With these characteristics as supervised feature to cluster Chinese and Vietnamese news texts to analyze implicit topics in Chinese and Vietnamese bilingual language.
Keywords/Search Tags:Wikipedia, Semantic Relatedness, Chinese and Vietnamese, Topic, Text similarity
PDF Full Text Request
Related items