Font Size: a A A

Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree

Posted on:2020-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:R Q JiangFull Text:PDF
GTID:2428330596478817Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network technology,network platform has become one of the main ways for people to obtain information.Web pages now are awash with various kinds of news,which contain many unrelated noise information to the theme of the webpage,which makes information integration and retrieval become More and more difficult,and also makes the topic of information extraction technology become more and more meaningful.Participating in the project of agricultural commodity filing and trading in the period of Master study,it is necessary to extract the body information from the major agricultural websites,both home and abroad,for database analysis and inquiry.Due to the large number of web pages,which includes various types,the existing precision rate and investigation extraction techniques with higher full-rates tend to have poor applicability,while high-availability extraction techniques do not have high precision and recall rates,so how to balance these two aspects is the key point of research.Since the project is about researching agricultural websites at home and abroad,so the data set is to select 1000 web pages randomly from 5 different types of agricultural websites,and then the text information of the web pages is extracted based on the DOM tree structure to improve the text information extraction.At the same time of versatility and adaptability,it also improves the recall and precision of the extraction.The main research results are as follows:(1)The page is partitioned by the node similarity of the DOM tree.Aiming at the problem of poor applicability of existing page blocking methods,propose the page partitioning scheme which based on node path similarity of DOM tree.Firstly,the webpage is representing structured in the form of a DOM tree,and then the node path of each leaf node on the DOM tree is defined by n-tuples.The similarity between the paths is calculated by the node similarity algorithm,then set the threshold,fuse the threshold which is lager.Since the node label and its attributes are not considered,so the applicability is strong,and the experiment also shows its applicability.(2)Judge the text value after the page is partitioned.Aiming at the problem that it is impossible to judge body information blocks and noise information blocks after the partition of pages,analyzes the existing model which based on DOM tree density and node important degree of body information extraction scheme,on the combination of the characteristics of node path density model and the calculation of node importance,based on threshold to distinguish the text information and noise information blocks,experiments show that compared with others,this method can achieve better results in extraction of recall and precision ratio.(3)An improved method which based on classifier-based threshold adaptive selection.Since there is no clear standard and improved method for the threshold setting,and the actual effect was not good.Therefore,an improved method which based on classifier-based threshold adaptive selection is proposed.The classifier is used to train the density value and the node importance degree,and the training result is obtained to determine the body block and the noise block.The experiment shows that the use of the classifier improved the recall and precision ratio.(4)Denoising the information block and extracting the body information.For the problem that the body information block contains a small amount of noise information,and the noise information block contains a small amount of body blocks,the centrality and continuity of the text information are used to perform structural denoising of the intra-block nodes,and then the node paths in the body information block are passed.The numbers are arranged and combined to extract the complete body information.
Keywords/Search Tags:Information extraction, DOM tree structure, Page block, Classifier
PDF Full Text Request
Related items