Font Size: a A A

Research On Techniques Of Automatic Data Record Analysis And Recognition For Accurate Web Information Extraction

Posted on:2012-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:F L QuanFull Text:PDF
GTID:2178330335963013Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Web has become an important platform for publishing and sharing information all over the world with the rapid development of information technology. More and more internet applications, such as E-commerce, microblog, social network and group purchase, are emerging along with the explosive growth of websites and available web information. Massive amount of valuable information is embedded in the sea of Web data. Thus, to offer more valuable information service, more and more applications expect to extract accurate and useful information for conducting further information processing or providing more valuable information service.The main challenge is how to improve the automation degree to help user to get rid of the burden of generating extraction rules while ensure the precision of data extraction. Most existing researches fail to achieve a good tradeoff between the data extraction precision and the user operation burden. For better solution, this paper proposes an integrated approach and model for accurate web information extraction, based on which related techniques about automatic data record structure analysis are studied. The major works are stated below:1) In order to complete the task of accurate Web information extraction, we propose an integrated approach and model based on automatic structure analysis and interactive rule generation. The model combines the automatic structure analysis with the method of semi-automatic web information extraction based on user interaction. During data analysis, automatic structure analysis technique is selected according to the feature in page to process data-intensive deep web pages; the semi-automatic method is adopted in other cases. The model has the warrant of data extraction precision as well as the reduction of user operation burden.2) In order to analyze data record, we build a feature system based on the elements in HTML document and the nodes in DOM tree by studying clearly the characteristic of HTML and DOM, in which the features can be divided into two categories:basic features about the information of node in DOM tree and the classified features about the different impact of HTML elements on the data structure analysis. The system establishes a good framework for weighted tree matching algorithm based on feature distribution and layered filtering strategy based on feature.3) We propose a weighted tree matching algorithm based on feature distribution. By analyzing the shortcomings of simple tree matching algorithm, this paper proposes a weighted tree matching algorithm based on feature distribution of nodes in DOM tree. The proposed algorithm assigns node weight based on features contained in the node. In this way, node significance is differentiated among different nodes and thus the analysis performance will be improved.4) We propose a feature based layered filtering strategy. By analyzing the characteristics of HTML elements and their different correlation with the structure semantic of data record, this paper proposes a feature based layered filtering strategy. In this strategy, structure elements and attribute elements are differentiated so that structure elements, which are more related with the structure semantic of data record, are used in priority for record structure analysis and attribute elements, even information in node, are only considered when the structure elements are not sufficient to get satisfactory records.5) We propose a data record recognition and filtering, algorithm to extract the valid data block. The weighted tree matching algorithm and layered filtering strategy are adopted in this algorithm to better measure the similarity of DOM trees. And then this algorithm recognizes several potential data blocks. In order to filter those invalid results, the visual characteristics are considered in this algorithm.6) On the basis of data record recognition, we propose a field analysis algorithm based on the visual features, DOM tree features and content features in web page. In this algorithm, in order to recognize data fields, each node x in DOM tree is checked whether it can be the starting node of data field by taking visual features and distribution features of DOM tree nodes into consideration. On the basis of data filed recognition, the content features are used to further correct the data filed recognition.Lastly, this paper conducts experiments over multiple websites to validate the proposed algorithm and gives profound analysis towards experimental results. The analysis indicates that the proposed algorithm can significantly improve record analysis performance.
Keywords/Search Tags:Accurate Web Information Extraction, Data Structure Analysis, Simple Tree Matching Algorithm, Weighted Tree Matching Algorithm, Data Record Recognition, Data Field Recognition
PDF Full Text Request
Related items