Research On Domain-oriented Deep Web Information Extraction

Posted on:2014-06-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y Gao

Full Text:PDF

GTID:2268330401970490

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The Deep Web contains more abundant and professional data resources compared to the Surface Web. With the rapid growth of information, research on the Deep Web has aroused increasingly more attention. The data of the Deep Web are semi-structured. How to extract the information and give them semantic information becomes one of the Deep Web widely concerned research topics.According to the application requirement of Deep Web information extraction, this paper uses several developing technologies, such as the Chinese word segmentation ontology modeling and machine learning, to do researches on the pretreatment of web pagesã€construction of domain ontologyã€construction of template and template matching, and carry on the Deep Web information extraction experiment on the field of weather and books. The main research work of this paper includes:(1) Research on pretreatment of web pages. Mainly studies how to present the HTML document as a hierarchical tree which has DIV block elementsã€attributes and text, and other process of pretreatment of web pages, which includes converting the DIV block to string flowã€Chinese word segmentation and word frequency statistics. The target of these processes is to handle the HTML document into a data set which with the DIV block as basic unit and contains the segmentation results.(2) Research on construction of domain ontology. As a semantic foundation in communication between different subjects, domain ontology in the process of template construction can have the effect of optimization, reducing the unrelated content which would appear in the template.(3) Research on construction of dual template. In consideration of the features of the HTML pages which usually use "DIV+CSS" on the whole structure design and table layout in detail place, this paper uses the combinations of DIV block template and table template. Using the result of web page pretreatment, by the algorithm of C4.5decision tree to train the classifier which can select the number of extracted DIV blocks, building the template of DIV blocks which can locate the data area. Then using the technology of XML, constructing XSLT document under the help of machine, to form the table template which can extract the data fragment. Experiment results show that the accurate rate of the classifier trained by C4.5decision tree can reach95.2%, which can ensure that DIV blocks judgment will not be wrong. Eventually uses the dual template for extracting and average accuracy rate and recall rate can achieve95%above, better extraction effect is obtained. Dual template is more stable than single template and at the same time, more conducive to maintenance.(4) Research on template matching. Error caused by the traditional regular string matching method based on the URL is very big. This paper, on the basis of the traditional methods, combines web page similarity calculation method in template matching. Firstly use the URL string matching roughly and then use the web page similarity matching accurately. Experiment results show that, under the condition of maintaining the efficiency, template matching accuracy can achieve93%, increased by32.9%compared with the traditional method.

Keywords/Search Tags:

Deep Web, domain ontology, DIV blocks template, table template, template matching

PDF Full Text Request

Related items

1	The Research On Moving Target Detection Based On Template Matching
2	Pretreatment Of Banknote Testing And Improved Method Of Template Matching
3	Research On Text Orientation Based On Template Matching
4	The Design And Implementation Of Template Matching Mechanisms In TTCN-3 Testing Platform
5	Study On Template Matching Algorithms Based On Gray Value
6	Research And Application Of Template Matching Algorithm In Image Stitching
7	Research On Image Detection Method For Text Defects In Printed Matter Based On Corner And Block Template Matching
8	Research And Improvement Of Microscopic Cell Image Segmentation Algorithm Based On Template Matching
9	Study On Technique Of Chinese Template Based On PHP
10	Template Recognition And Extraction Of Complex Table Document Images