Font Size: a A A

Research On Domain-oriented Deep Web Information Extraction

Posted on:2014-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y GaoFull Text:PDF
GTID:2268330401970490Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Deep Web contains more abundant and professional data resources compared to the Surface Web. With the rapid growth of information, research on the Deep Web has aroused increasingly more attention. The data of the Deep Web are semi-structured. How to extract the information and give them semantic information becomes one of the Deep Web widely concerned research topics.According to the application requirement of Deep Web information extraction, this paper uses several developing technologies, such as the Chinese word segmentation ontology modeling and machine learning, to do researches on the pretreatment of web pages、construction of domain ontology、construction of template and template matching, and carry on the Deep Web information extraction experiment on the field of weather and books. The main research work of this paper includes:(1) Research on pretreatment of web pages. Mainly studies how to present the HTML document as a hierarchical tree which has DIV block elements、attributes and text, and other process of pretreatment of web pages, which includes converting the DIV block to string flow、Chinese word segmentation and word frequency statistics. The target of these processes is to handle the HTML document into a data set which with the DIV block as basic unit and contains the segmentation results.(2) Research on construction of domain ontology. As a semantic foundation in communication between different subjects, domain ontology in the process of template construction can have the effect of optimization, reducing the unrelated content which would appear in the template.(3) Research on construction of dual template. In consideration of the features of the HTML pages which usually use "DIV+CSS" on the whole structure design and table layout in detail place, this paper uses the combinations of DIV block template and table template. Using the result of web page pretreatment, by the algorithm of C4.5decision tree to train the classifier which can select the number of extracted DIV blocks, building the template of DIV blocks which can locate the data area. Then using the technology of XML, constructing XSLT document under the help of machine, to form the table template which can extract the data fragment. Experiment results show that the accurate rate of the classifier trained by C4.5decision tree can reach95.2%, which can ensure that DIV blocks judgment will not be wrong. Eventually uses the dual template for extracting and average accuracy rate and recall rate can achieve95%above, better extraction effect is obtained. Dual template is more stable than single template and at the same time, more conducive to maintenance.(4) Research on template matching. Error caused by the traditional regular string matching method based on the URL is very big. This paper, on the basis of the traditional methods, combines web page similarity calculation method in template matching. Firstly use the URL string matching roughly and then use the web page similarity matching accurately. Experiment results show that, under the condition of maintaining the efficiency, template matching accuracy can achieve93%, increased by32.9%compared with the traditional method.
Keywords/Search Tags:Deep Web, domain ontology, DIV blocks template, table template, template matching
PDF Full Text Request
Related items