Research On Web Information Extraction Based On Clustering Algorithm

Posted on:2012-04-29

Degree:Master

Type:Thesis

Country:China

Candidate:T F Qiu

Full Text:PDF

GTID:2178330335963926

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, web has become a large and complex information resource. At present, Web data comes in the form of HTML pages which are mostly presented as in structured or semi-structured form, and dynamically created. One typical type is the Commodity Website. There is a huge number of this kind of websites and they offer abundant information. But most of the data they provide cannot be directly used or analyzed by the application software. In order to promote the effective use of the Web data, researchers began to study on web information extraction technology. This technology renders the usage of the web data by software possible and promotes the internet development to a new height.Based on the characteristics of dynamic websites, this paper designed an information extraction system based on DOM structure of pages. The system could accurately cluster web pages, generate the wrapper, effectively extract the data from pages and save them as structural data.This paper firstly studied the structure of dynamic web pages, and calculated the similarity among pages based on tree edit distance. Then we used the hierarchical clustering algorithm to cluster the pages. We improved the accuracy of clustering result by setting global self-similarity threshold and column similarity threshold in the process of page clustering. As to the wrapper, by improving the method of record data and optimizing the data record pattern, we reduced the computational cost of pattern matching and promote the efficiency of information extraction system. Semantic annotation of data nodes had strong adaptability. It annotated the nodes according to the characteristics of the data nodes and distinguishes different types of pages. Finally, the system achieved the goal of automatically extracting the date from web pages. Through theoretical analysis and experimental results, it came to the conclusion that our method could effectively extract the data from structural websites.

Keywords/Search Tags:

information extraction, page clustering, pattern tree optimization, semantic annotation

PDF Full Text Request

Related items

1	Research On Key Techniques For AIE-based Semiautomatic Annotation Of Web Page
2	Study On Data Extraction And Semantic Annotation For Specific Field Deep Web
3	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website
4	Research On Mining Structure Of WEB Page For Information Extraction
5	Key Techniques On Deep Web Data Extraction
6	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree
7	Research On User Feature Extraction And Behavior Pattern In Location-based Social Network
8	A Study On Feature Design Algorithms With Application To Image Annotation And Information Extraction
9	The Research Of The Emerging Technology Weak Signal Recognition Based On Patent
10	Researeh On Web Information Extraction Based On Page Structure Clustering