The Research Of Semi-structured Web Pages Information Extraction

Posted on:2012-08-01

Degree:Master

Type:Thesis

Country:China

Candidate:M L Zhu

Full Text:PDF

GTID:2178330338493796

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, the network has become an important platform for people to issue and access information. Currently, the number of resources accessible in dynamically generated Web sites resulting from underlying databases grows dynamically, such as the list of goods pages. These pages are generally obtained by a particular query interface, and they are difficult to use for other applications directly. Therefore, how to extract information from these pages becomes very important. This paper studies how to extract information from these semi-structured pages.Introduced the concept of Web information extraction, the main problems and related technologies, analyzed the Web information extraction algorithms and the advantages and disadvantages. To solve the existing methods can not find the main data region very accurate problem, combined three heuristic methods which are highest fanout, largest size increase and largest tag count to find the main data region. When dividing the data records, all existing methods need to compute the similarity of subtrees, the efficient of these algorithms is not very well, to solve this problem, proposed the tree edit distance clustering algorithm, the clustering algorithm reduced the number of comparisons subtree to improve the efficiency of the algorithm. At the same time using the tree edit distance to represent the similarity of subtrees, the algorithm gets a higher accuracy. After clustering obtain the candidate divide lists, gives a formula to choose the best segment scheme. Used master alignment algorithm to extract the attributes of the data records.Experiments show that our method has a higher degree of automation and higher efficiency. Through the real web pages test, our method has a higher accuracy.

Keywords/Search Tags:

Web Page Information Extraction, data records, tree edit distance, cluster algorithm

PDF Full Text Request

Related items

1	Web Information Extracting Based On Tree Edit Distance
2	Research On Automatic Web Information Extraction Technique
3	Study On ETL Technology Based On XML Data Resouces
4	Storage Optimization And Tree Vertical Merging Algorithm Of Tai Tree Editing Distance Algorithm
5	Table Information Extraction Based On Web Structure
6	Algorithms Based On Visual Similarity Of The Research In Information Extraction And Implementation
7	Research On Deep Web Information Extraction Technology
8	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree
9	Research On Mining Structure Of WEB Page For Information Extraction
10	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website