Font Size: a A A

Research On The Method Of Constructing Chinese And Vietnamese Comparable Corpus Based On

Posted on:2016-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:L LuoFull Text:PDF
GTID:2208330470970570Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Comparable corpora has been a hot and difficult problems in the field of natural language processing, It is important basic resources for statistical machine translation and cross-language information retrieval. In recent years, The political, cultural, economic are increasingly frequent between China and Vietnam With the opening up of China’s great development for Southeast Asia, However, the language barrier between the two countries has become a stumbling block to the development of cooperation. In this situation, Vietnamese and Chinese language information processing has become increasingly important. At present, In Vietnam-Chinese bilingual language information research, mainly lexical, syntactic, and semantic aspects of the research work about Vietnamese or Chinese language itself, But specifically for the Chinese bilingual understood as bilingual corpus to build, work and other aspects of bilingual machine translation research work carried out is still relatively small. Although some researchers digging bilingual parallel corpus in the network has carried out research work, However, due to the diversity, complexity, network message format, and the Chinese bilingual website on the Internet less, so get a certain scale and more difficult through the network of high-quality parallel corpora. However, the artificial construct massive parallel corpus is required to understand the Vietnamese but also understands that the Chinese language experts and very time-consuming, costly to build.In response to these problems, based on research and analysis of existing research on the work carried out to compare Chinese and Vietnamese corpus construction method study completed a major study about several aspects of the1) WEB News extraction method Based on the density of the block of textThe boundary does not recognize the page area caused boundary caused by different news web site structure and layout of the larger differences, as well as existing web content extraction lower universal issues. This paper proposes of WEB News extraction which is based on the density of the block of text, Mainly through the integration of Html language features with structural features convert the web pages of source text into content block, respectively, to give the title block and text block. News headlines extracted which uses rule-based approach from the title block, Then in the form of lines in units of document text block is converted to the text sequence, and then merge the chunks based on the continuity of the contents of adjacent rows, the last series extraction news text sub-blocks according to the maximum, the method can accurately extract the News headlines, news text message, and versatility2) Extraction of key events news topic sentence algorithm based on TextRankExtraction levels for atomic particles too small events, lack of practicality, the subject matter coarser, less accurate, while the existing methods to solve the problem of insufficient information about handling events, This paper proposes extract key events of news based on TextRank, Mainly through the news text split into atomic events which with accordance to the sentence as unit, Through TextRank algorithm to calculate the similarity between each atom of events, Then selected top N (N is a fixed value) sentences which according to Weights about similarity, Finally, extract the key events news topic sentences, this method can accurately extract key news events topic sentence3) Cross-language news text similarity calculated based on the vector spaceCross-language news text to occurred deviate problem when retrieving matching theme, for causing comparable corpus less accurate matching results. This paper proposes a method for constructing comparable corpus based on similarity computing which based on vector space. The core of this approach is that learning and training vector space model from a bilingual parallel corpus, And obtain a bilingual word vector space, without bilingual dictionary translation glossary, The integration of time similarity calculation and news events similarity calculation.
Keywords/Search Tags:WEB mining, comparable corpus, web news extraction, topic sentence extraction, cross-language text similarity
PDF Full Text Request
Related items