Font Size: a A A

Webpage Text Extraction And Bilingual Website Detetion Based On Multi-feature Fusion

Posted on:2015-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:W Q LiFull Text:PDF
GTID:2298330422990881Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, The scale of Information on internetexponentially grew, Along with the massive Internet information isuneven quality,which make accurately, fast, comprehensively access to information is becoming moreand more difficult. The ability ofinformation extraction is beingstronglyconcerned.Information’s accumulation also presents new opportunities andchallenges to information extraction technology. With the rapid development of NaturalLanguage Processing technology, Machine Translation technology in real life hasbecome more and more practical, Youdao-Online-Translation, Google-Translation,Baidu-Translate and other related products have become an important tool for nonprofessional study and work with foreign language.Bilingual corpus is the foundation of machine Translation, whichis an importantdata for Machine Translation training, testing, analysis of Machine Translation model.The quantity and quality of bilingual corpus directly relate to Machine Translation’sparameter training results, at the same time, largelyimpact on the Machine Translationproduct performance and subsequent. It’s have great valueforproducts and academicmeaning to build a high quality, a large number of bilingual corpus for MachineTranslation, Natural Language Processing etc..This paper focuses on the architecture and implementation of a bilingual textextraction system with a high performance and high efficiency (this system is asubsystem of a integrity bilingual corpus extraction system, the Internet crawlingsystemand sentence alignmentare not included). The main content of this paper includestwo aspects: Webpage bilingual and Webpage text extraction detection.In this paper, the use of multi features fusion for Webpage text extraction, Webpageprocessing method is different from the traditional DOM tree generation’s function, thispaper adopts a linear reconstruction method based on the container label when process aWebpage, which makes the data structure of the algorithm needing for tree operationssimplifies to processing based on a linear table, at the same time, length, wordsegmentation results, the number of sentences, and more comprehensive feature to findthe article structure, and then make Webpage text clustering based on information gain.In the bilingual Webpage detection this paper use bilingual text to text translation ratewhich is calculated based on local sentence anchors between one sentence and a fewsentences on the same location. On the basis of this,this paper joins algorithm to namedentity of coincidence degree, the longest common subsequence ratio, pronouns ratio toauxiliarydetermine the article structure.When the system processes a lots of webpages from a same website,it will automatic generate a template for the website.The Webpage text extraction and bilingual Webpage detection system reachedtate-of-art performance in the field, the system and the subsequent processing systemgenerated the thirty millionbilingual sentences’ percision given by the Institute ofsoftware in Heilongjiang Electronic Information Products Supervision InspectionTesting Center was above95%. The experimental results verify the proposed featurefusion method is effective in the field of mining bilingual corpus.
Keywords/Search Tags:machine translation, bilingual corpus, webpage text extraction, features, linearization
PDF Full Text Request
Related items