Font Size: a A A

The Research On Text Extraction From Web Pages

Posted on:2011-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:J WanFull Text:PDF
GTID:2178360302988339Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet, WWW has become a huge information web-space providing valuable information resources. However, around the text of a page, usually there is much noise information that has nothing to do with the text and affects the users browsing the web pages. The research extract the text from the web pages effective avoiding irrelevant information is very interesting for Web data cleaning, the formation of automatic summary, document classification and many applications in other fields.Therefore the method extracts text based template is more effective. This paper analysis DSE(Data-rich Sub-tree Extraction) algorithm first and then proposes an improved text extraction method based on the structure and content of the page—DTE(DOM-Based Text Extraction). This paper design and realize a web page text extraction prototype system and applies the DTE algorithm to the prototype system. For many existing algorithms only pay attention to the web structure or only focus on the web content, DTE algorithm remedy the defects by paying attention to the content and structure at the same time to obtain the precise web page templates.First parses the Web page into DOM tree, then comparing and matching the nodes in the DOM tree. By using this method confirm the noise information node and web text nodes to obtain the precise information node page template. When a new Web page appears, remove the most of the noise information through the page template. Then use the semantic relation to determine if the mix node is noise or text. This paper use some algorithms to locate the web comments section as a part of the template, use the semantic distance to judge whether the pictures are a part of the text, as a result we can keep the pictures and the information table belongs to the text as they are.Experimental results show that the system can achieve a higher accuracy and completing rate, it can extract text from the web pages effectively.
Keywords/Search Tags:text, information extraction, HTML, DOM tree
PDF Full Text Request
Related items