Font Size: a A A

Research On Web Information Extraction Based On Clustering Algorithm

Posted on:2012-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:T F QiuFull Text:PDF
GTID:2178330335963926Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, web has become a large and complex information resource. At present, Web data comes in the form of HTML pages which are mostly presented as in structured or semi-structured form, and dynamically created. One typical type is the Commodity Website. There is a huge number of this kind of websites and they offer abundant information. But most of the data they provide cannot be directly used or analyzed by the application software. In order to promote the effective use of the Web data, researchers began to study on web information extraction technology. This technology renders the usage of the web data by software possible and promotes the internet development to a new height.Based on the characteristics of dynamic websites, this paper designed an information extraction system based on DOM structure of pages. The system could accurately cluster web pages, generate the wrapper, effectively extract the data from pages and save them as structural data.This paper firstly studied the structure of dynamic web pages, and calculated the similarity among pages based on tree edit distance. Then we used the hierarchical clustering algorithm to cluster the pages. We improved the accuracy of clustering result by setting global self-similarity threshold and column similarity threshold in the process of page clustering. As to the wrapper, by improving the method of record data and optimizing the data record pattern, we reduced the computational cost of pattern matching and promote the efficiency of information extraction system. Semantic annotation of data nodes had strong adaptability. It annotated the nodes according to the characteristics of the data nodes and distinguishes different types of pages. Finally, the system achieved the goal of automatically extracting the date from web pages. Through theoretical analysis and experimental results, it came to the conclusion that our method could effectively extract the data from structural websites.
Keywords/Search Tags:information extraction, page clustering, pattern tree optimization, semantic annotation
PDF Full Text Request
Related items