Research Of Web Information Extraction Based On XML

Posted on:2011-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:C X Fan

Full Text:PDF

GTID:2178330332969529

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, the Web data information is sharp increasing, which becomes the biggest information source beyond the other sources. Consequently, how to extract valuable information form web has become a research focal point. Currently, a mass of Web information will be showed in the information display page which is main media, so the reseach of such pages has become extremely significant and practical.HTML is very successful in the display data, and it focuses on the performance of the data, rather than a description of the data, so according to label, we can not gain the content it contains through label. XML is a new technology that focuses on operating the data, as a result, it has great advantages to extract data by XML technology. XHTML provides a brigde for them, and it can convert HTML to XHTML which meets the XML technical norms.Thanks to using HTML technology in an army of Web page, in this thiese, extract data of information display page taking advantage of XML-related technologies. Its solution is: Firstly, Access to target information display page and cleaning this page, then the cleaned HTML source is converted into structured XHTML document by Ntidy tool.Secondly, Extract main data block via empowering the value to DOM tree node and generate data record; Finally, choose the best useful information through XML-based field vocabulary and the number of words in the data record, and store the best data record.In this thiese, reseaches have been done on related technology of information extraction. According to the feature of information display pages, we propose information extraction method and establish a model of Experiment. During the course of extrating information, we choose rational value for main data block, so it can get rid of the noise information; we also adopt the method of second recognize value, to extract infortion exactly. The experiments show that this method obtained good results in recall ratio and accuracy rate.

Keywords/Search Tags:

Web Information Extraction, XML, Information Display Page, Weight Coefficient

PDF Full Text Request

Related items

1	Research On WEB Entity Information Extraction Algorithm And Its Application
2	The Research And Implementation Of Commodity Information Extraction And Fusion Based On Web
3	Research On Web Article Automatic Extraction Method Based On Page Segmentation
4	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website
5	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
6	Research And Implementation Of Web Page Display For Handhold Intelligent Terminal Based On Information Extraction
7	Research On Specialty Knowledge Retrieval Method Based On Web Information Extraction
8	A Study On Methods Of Web Page Topical Information Extraction
9	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree
10	Researeh On Web Information Extraction Based On Page Structure Clustering