Font Size: a A A

Web Topic Information Extraction System Design And Implementation

Posted on:2013-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:H Y WuFull Text:PDF
GTID:2268330392469553Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosion of the information on the internet, the internet is becoming a very important information source in people’s daily life. Because of the large amount, people’s manual looking up is becoming more and more difficult. As a result the search engine becomes an replaceable tool in people’s life. Search engine is finding the information by information. The first step of using information is to understand it, while the information used in search engine is the Web pages which contain noise. So, the Web page information extraction is the focus of the researchers in the field of search engine industry.The thesis introduced a general way of the Web page topic information extraction. Nowadays the pages on the Web do not have a formal structure. The thesis first doting the Web page pre-processing, which recognized the Web page file type, handled the encoding problem, extracted the script and do some pages purifications. Based on the existing Web topic information system not using the Web page structure and vision feature, the thesis introduced a method of Web page topic extraction using vision information and semantics feature. The algorithm formalized the page from unstructured form into structured DOM Tree. At the same time extract the CSS, and render the DOM Tree, which then contained vision information. Next, using VIPS algorithm paging the page, which turns the DOM Tree into a layered witch semantic information feature content tree. Then, cluster the content block into different categories, having a clustering set. Finally, use the content block structure and somatic feature; score the each of them. According the pre-defined threshold score, do the extraction and output work of the topic information.In the experiment of the Chinese news page, the accuracy F is0.93. The score is0.84when the algorithm was applied to the Chinese normal Web page’s topic extraction. So, the thesis gave the conclusion that the algorithm could satisfy the application requirements...
Keywords/Search Tags:web topic information extraction, web pre-processing, web pageparser, VIPS algorithms, web block clustering
PDF Full Text Request
Related items