Font Size: a A A

Research On Web-based Full-station Data Information Extraction Based On Template

Posted on:2018-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:T F LiFull Text:PDF
GTID:2358330536488535Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet and mobile Internet has led to an explosive growth in global data,and the Web has become a huge source of information as the most potential and valuable area.Web pages in addition to the user concerned about the text and other related information,but also contains a large number of navigation,advertising and copyright and the page theme has nothing to do with the noise information.In the face of massive and complicated Web information,how to quickly and efficiently obtain the necessary information to do further mining,to obtain more potential value has become a research significance and practical significance of the subject.Web information extraction(Information Extraction)research is to meet these needs and the emergence of hot research direction,and is widely used in business data mining,social networking analysis and vertical search engine and other fields.The socalled Web information extraction from semi-structured or unstructured Web pages in the extraction of data,into structured data for mining and use.At present,most of the web pages on the Internet are dynamically generated through the template.The method of information extraction based on machine learning and the method of information extraction based on statistics are too dependent on the quality and quantity of corpus,and do not make full use of the template and structural features of web pages The distribution of data in different languages is not balanced,to a certain extent,also affected the accuracy of Web information extraction.Therefore,in order to extract structured information from some similar web pages,you can take full advantage of the structural features of web pages,exploit the common points of such pages,find templates for web pages,and use templates to extract information from web pages.Based on the analysis of the background of Web information extraction and the existing problems of the existing extraction algorithms,the main contents of this paper are as follows:(1)This paper presents a framework of information extraction algorithm for Web site data.Data on the various types of web pages,the use of improved suffix tree structure to efficiently identify each page of the duplicate records,the DOM(Document Object Model Tree)pruning and merging records,and then through the cluster will be different Templates generated by the separate pages,in each category using unsupervised method to extract the corresponding template,the use of these templates to extract key information.(2)An improved clustering algorithm based on improved K-Means is proposed.In the process of clustering,we add the local sensitive hash method to calculate the fingerprint information of the class,and use it to screen out a small number of candidate classes,and then find out the most important categories from the candidate class.Similar class,the same type of incremental merging,which quickly different types of different types of web pages to facilitate the extraction of web templates.(3)A template extraction and matching algorithm which makes full use of the structure of web pages is proposed.The improved minimum common sub-sequence algorithm is used to extract the web page template according to the depth information of DOM tree,and the new page is matched and extracted according to the template.In order to verify the validity of the algorithm,this paper starts with the data obtained from the mainstream Web site,carries on the detailed experiment and the analysis to the core module,compares with the existing partial extraction algorithm,and the result shows that the method can automatically find and Extract the main information of the page,extraction accuracy has improved significantly.
Keywords/Search Tags:Web page, Information extraction, Web page template, Similarity calculation, Incremental clustering
PDF Full Text Request
Related items