Font Size: a A A

Research On DOM Based Intelligent Web Information Extraction Technology

Posted on:2010-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:J T QuFull Text:PDF
GTID:2178360275485721Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the Internet, Web has become huge, distributed and shared information resources. Currently most Web data comes out in the form of HTML pages. Because the data described in HTML is a kind of semi-structured data, making the web page only suitable for human browsing, while applications cannot directly resolve the Web and make use of rich information. In the Web world, an important kind of data information is provided through dynamic Web pages, such as various portal news, e-commerce websites, etc. This kind of web page has less free text data and a high degree of structure, often with rich content, and consequently extraction is very valuable work. How to use program to extract useful information from the mass web rapidly so as to improve the efficiency of information extraction for people has become more and more important. In order to enhance the usability of Web data, provide more value-added services, Web information extraction technology is proposed. Through the wrapper of the existing Web sources of information, it is able to extract structured information from the web pages, which makes it possible for the application to make use of the Web. Therefore this technology provides a broad prospect, which is one of the hot research fields in data mining.This paper proposes a DOM model based intelligent information extraction system, it can make automatic analysis of web page text, feature extraction and selection, text classification and regional segmentation and reconstruction of pages and so on, in order to extract useful information to be structured stored in a database, and can be used in any specific information query applications.Firstly this paper introduces the research and development of the information extraction technology, makes a comparison of several kinds of typical Web information extraction system, then introduces the DOM model theory with programming practices and text classification. Next chapter elaborates the structure, design method and process of the main page information extraction system. At first the solutions of DOM Parser based text preprocessing is discussed. For feature distillation, there is a method called the value of IG, as a feature weighting function, used for weighting the HTML texts features and feature distillation. In the automatic texts categorization chapter, the KNN-SVM algorithm is used for texts categorization. The method of page segmentation with mapping table is analyzed, followed by page rebuilding according to segment relation.At the end of the paper, a prototype of DOM based intelligent web information extraction system is given, which proved good results. From a series of extraction experiments from dynamic Web pages, and the results comparison with some other kinds of information extraction algorithm, the web information extraction method proposed in this paper achieves a high precision extraction.
Keywords/Search Tags:DOM, Information Extraction, Texts Categorization, Feature Extraction
PDF Full Text Request
Related items