Design And Implementation Of Text Information Extracting Modules Of Html Web Pages Based On DOM

Posted on:2012-10-07

Degree:Master

Type:Thesis

Country:China

Candidate:X L Su

Full Text:PDF

GTID:2218330338453042

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Function of extracting text information from HTML web page has been a basis work and a problem to be solved for many internet applications at present. The "text" information HTML web pages express is usually contained in information of "noise". Two parts of contents always turn up as we view a web page. One is the text information of the web page, e.g. the resume part in a resume web page is "text" information. The other is contents such as navigation bars, advertising information, copyright information and so on which have nothing to do with the text information of the web page. We call this information as "noise" information. It is hard for users to rapidly obtain subject information because of the existence of much noise information. To solve this problem, how to rapidly and accurately extract text information from web pages is one of the key technologies to influence the service quality of internet appliance.A method usually applied to extract text information from HTML web page is inductive learning. Extracting rules can be learnt from the given web training samples. Text information can be accurately extracted by this method, but extracting rules shall be learnt again after the changes of website templates. As the increasing number of templates, the maintenance cost of the extracting machine will become more and more expensive and its flexibility will be bad.The approach of this passage is based on specification of document object models, and the HTML codes represent the entire DOM tree and another program traverses this DOM tree. Contents are judged according to subject relevance of each node and context of the corresponding node. Through this judging method, extracting information can be determined, irrelevant information can be deleted, and what output at last is only text information of HTML document. After reading the extracting method of'Content Extraction from Chinese Web Page Based on Statistics', this method adds the judgment to the contexts of nodes, which can extract text information more accurately. Meanwhile, this method doesn't rely on template information of web pages, but is a universal approach to extract text information. Finally, experiment result also proves the accuracy and effectiveness of this method.

Keywords/Search Tags:

Information extracting, Document object model, Analysis of web pages, Topics web crawler

PDF Full Text Request

Related items

1	Research On The Technology Of Incremental Web Pages Crawler
2	Study On Information Autonomous Extraction Technology Of Web Pages
3	Design And Implementation Of Distributed Web Crawler System Supporting Dynamic Web Pages Paring
4	Design And Implementation Of Crawler Technology For Topics
5	Research On Extracting Information From Chinese Web Pages Based On Conceptual Model
6	Design And Implementation Of HelloPaper:An Automatic System For Document Analysis
7	Analysis Method Of Targeted Information Based On Weibo Topics
8	Temporal Analysis of Topics in Time-Stamped Document Sets
9	Researches And Implement On Object Extracting Of Legacy Systems Written In Conventional Procedural Language
10	Research And Optimization Of Distributed Crawler System Based On Nutch