| Into the Internet age,people used to use the Internet media to express and spread their own social life and network events hot spots,focus held by the emotions,attitudes and views.This information remains in the network of continuous influence,the formation of what we usually say "network public opinion",that is,network public opinion.These developments are an important channel for government departments to understand public opinion,and how to obtain these public opinion information is the focus of this article.The accuracy and timeliness of information acquisition will directly affect the source,judgment and analysis of network public opinion.In the traditional public opinion collection,mostly based on the template extraction method.However,with the development of HTML5,CSS3 technology,artificial custom templates have been unable to adapt to the ever-changing pages.So how to make the computer program can automatically and accurately from the ever-changing pages to extract the structure of the target data,has been the public opinion system to be resolved.This paper chooses adaptive Web information extraction as the research direction.In this paper,an adaptive Web information extraction method based on single DOM tree feature pre classification is proposed to analyze the page form of the current mainstream website and combine the information data of public opinion.The main research contents are:1.In-depth study of the characteristics of the page,the construction of the Internet for the public carrier type identification feature set.Introduce the classification algorithm in machine learning to pre classify the information hyperlinks in the upper page.Filter out the URL of the information we need from the complex navigation pages,excluding invalid links and ad links.2.Research the characteristics of the same type of page structure,and propose the method of web page information extraction of homologous web pages.Will be two homologous pages of the DOM tree to calculate the contrast,effectively identify theinformation in the page data,generate information extraction template.And then use the template to complete the site of other pages of automatic information extraction.3.Design and implement the above scheme,the pre classification results,page information extraction rate was tested.The innovation of the thesis lies in:1.Introduce the classification algorithm in machine learning,pre classify the information hyperlinks in the upper page through the page structure feature,text feature and hyperlink feature.2.For the page to be taken,calculate the comparison of two homologous DOM trees.The method of adaptive web information extraction based on single DOM tree feature pre classification is proposed by using the regular matching,text feature and so on to identify and identify the specific information in the page to generate the extraction template. |