Font Size: a A A

Research Of Web Information Extraction Method Based On Multi-feature Mining

Posted on:2019-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y GaoFull Text:PDF
GTID:2428330566499007Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,there generate numerous new Web pages everyday,and most are the news and blog Web pages.As the hot of self-media,the proportion of news and blog pages is increasing year by year.Many Web information extraction algorithms appear in order to catch useful information an d remove meaningless things in Web pages automatically.Such algorithms mainly use Web page textual statistical characteristic or structure characteristic to extract Web page information easily and efficiently.However they don't utilize the Web page information sufficiently,which make the performance of the extraction not stabilized.Therefore,in this paper the author presents a new algorithm about Web information extraction based on multi-feature mining.The algorithm also uses the Web page textual statistical characteristic,structure characteristic and visual information characteristic and selects nodes with main con tent through precision and recall based CSS DOM node choice algorithm.For the CSS DOM nodes classification problems,the author raises CSS DOM nodes classification based on multi-feature mining algorithm.At first the algorithm extracts Web page textual statistical characteristic,structure characteristic and visual information characteristic from CSS DOM nodes,then uses the machine learning classification algorithms to train classification models by the labeled feature data,and extracts same feature from the CSS DOM nodes on the new Web pages and uses the trained classification models to classify the new feature data at last.Experimental results show that the algorithm can separate main content nodes and noise content nodes on the new Web pages.They also show that using multi-feature effect is better than the effect using textual statistical feature or visual information feature only.For the main content of Web pages identifying problem,the author proposes precision and recall based CSS DOM node choice algorithm.The algorithm and the CSS DOM nodes classification based on multi-feature mining algorithm build the Web information extraction based on multi-feature mining algorithm together.Experiment results prove that the Web information extraction based on multi-feature mining algorithm has excellent effect and the robustness of it has better performance than the existing algorithms.
Keywords/Search Tags:Web information extraction, visual features of Web pages, textual features of Web pages
PDF Full Text Request
Related items