Font Size: a A A

Research Of Web Information Extraction Based On XML

Posted on:2011-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:C X FanFull Text:PDF
GTID:2178330332969529Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the Web data information is sharp increasing, which becomes the biggest information source beyond the other sources. Consequently, how to extract valuable information form web has become a research focal point. Currently, a mass of Web information will be showed in the information display page which is main media, so the reseach of such pages has become extremely significant and practical.HTML is very successful in the display data, and it focuses on the performance of the data, rather than a description of the data, so according to label, we can not gain the content it contains through label. XML is a new technology that focuses on operating the data, as a result, it has great advantages to extract data by XML technology. XHTML provides a brigde for them, and it can convert HTML to XHTML which meets the XML technical norms.Thanks to using HTML technology in an army of Web page, in this thiese, extract data of information display page taking advantage of XML-related technologies. Its solution is: Firstly, Access to target information display page and cleaning this page, then the cleaned HTML source is converted into structured XHTML document by Ntidy tool.Secondly, Extract main data block via empowering the value to DOM tree node and generate data record; Finally, choose the best useful information through XML-based field vocabulary and the number of words in the data record, and store the best data record.In this thiese, reseaches have been done on related technology of information extraction. According to the feature of information display pages, we propose information extraction method and establish a model of Experiment. During the course of extrating information, we choose rational value for main data block, so it can get rid of the noise information; we also adopt the method of second recognize value, to extract infortion exactly. The experiments show that this method obtained good results in recall ratio and accuracy rate.
Keywords/Search Tags:Web Information Extraction, XML, Information Display Page, Weight Coefficient
PDF Full Text Request
Related items