Font Size: a A A

Design And Implementation Of Text Information Extracting Modules Of Html Web Pages Based On DOM

Posted on:2012-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:X L SuFull Text:PDF
GTID:2218330338453042Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Function of extracting text information from HTML web page has been a basis work and a problem to be solved for many internet applications at present. The "text" information HTML web pages express is usually contained in information of "noise". Two parts of contents always turn up as we view a web page. One is the text information of the web page, e.g. the resume part in a resume web page is "text" information. The other is contents such as navigation bars, advertising information, copyright information and so on which have nothing to do with the text information of the web page. We call this information as "noise" information. It is hard for users to rapidly obtain subject information because of the existence of much noise information. To solve this problem, how to rapidly and accurately extract text information from web pages is one of the key technologies to influence the service quality of internet appliance.A method usually applied to extract text information from HTML web page is inductive learning. Extracting rules can be learnt from the given web training samples. Text information can be accurately extracted by this method, but extracting rules shall be learnt again after the changes of website templates. As the increasing number of templates, the maintenance cost of the extracting machine will become more and more expensive and its flexibility will be bad.The approach of this passage is based on specification of document object models, and the HTML codes represent the entire DOM tree and another program traverses this DOM tree. Contents are judged according to subject relevance of each node and context of the corresponding node. Through this judging method, extracting information can be determined, irrelevant information can be deleted, and what output at last is only text information of HTML document. After reading the extracting method of'Content Extraction from Chinese Web Page Based on Statistics', this method adds the judgment to the contexts of nodes, which can extract text information more accurately. Meanwhile, this method doesn't rely on template information of web pages, but is a universal approach to extract text information. Finally, experiment result also proves the accuracy and effectiveness of this method.
Keywords/Search Tags:Information extracting, Document object model, Analysis of web pages, Topics web crawler
PDF Full Text Request
Related items