The Research On Text Extraction From Web Pages

Posted on:2011-01-08

Degree:Master

Type:Thesis

Country:China

Candidate:J Wan

Full Text:PDF

GTID:2178360302988339

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of Internet, WWW has become a huge information web-space providing valuable information resources. However, around the text of a page, usually there is much noise information that has nothing to do with the text and affects the users browsing the web pages. The research extract the text from the web pages effective avoiding irrelevant information is very interesting for Web data cleaning, the formation of automatic summary, document classification and many applications in other fields.Therefore the method extracts text based template is more effective. This paper analysis DSE(Data-rich Sub-tree Extraction) algorithm first and then proposes an improved text extraction method based on the structure and content of the page—DTE(DOM-Based Text Extraction). This paper design and realize a web page text extraction prototype system and applies the DTE algorithm to the prototype system. For many existing algorithms only pay attention to the web structure or only focus on the web content, DTE algorithm remedy the defects by paying attention to the content and structure at the same time to obtain the precise web page templates.First parses the Web page into DOM tree, then comparing and matching the nodes in the DOM tree. By using this method confirm the noise information node and web text nodes to obtain the precise information node page template. When a new Web page appears, remove the most of the noise information through the page template. Then use the semantic relation to determine if the mix node is noise or text. This paper use some algorithms to locate the web comments section as a part of the template, use the semantic distance to judge whether the pictures are a part of the text, as a result we can keep the pictures and the information table belongs to the text as they are.Experimental results show that the system can achieve a higher accuracy and completing rate, it can extract text from the web pages effectively.

Keywords/Search Tags:

text, information extraction, HTML, DOM tree

PDF Full Text Request

Related items

1	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
2	Based On The Html Pages Of Web Information Extraction
3	The Research On Web Information Extraction Based On HMM
4	Research And Application On The Technology Of Web Information Extraction Based On The HTML
5	Extraction Technology Research, Based On Ontology Can Be Customized Web Information Intelligence
6	Research On The HTML And PDF Informaiton Extraction Technology Based XML
7	The Technology Of Web Information Extraction Based On HTML Parser
8	The Literature Information Retrieval And Matching From The Web
9	Web Information Extraction Technology Applied Research, Competitive Intelligence Platform In The Enterprise
10	Pattern-Based Information Extraction From HTML Documents