Research Of Web Information Extraction Method Based On Multi-feature Mining

Posted on:2019-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Gao

Full Text:PDF

GTID:2428330566499007

Subject:Computer Science and Technology

Abstract/Summary:

With the rapid development of the Internet,there generate numerous new Web pages everyday,and most are the news and blog Web pages.As the hot of self-media,the proportion of news and blog pages is increasing year by year.Many Web information extraction algorithms appear in order to catch useful information an d remove meaningless things in Web pages automatically.Such algorithms mainly use Web page textual statistical characteristic or structure characteristic to extract Web page information easily and efficiently.However they don't utilize the Web page information sufficiently,which make the performance of the extraction not stabilized.Therefore,in this paper the author presents a new algorithm about Web information extraction based on multi-feature mining.The algorithm also uses the Web page textual statistical characteristic,structure characteristic and visual information characteristic and selects nodes with main con tent through precision and recall based CSS DOM node choice algorithm.For the CSS DOM nodes classification problems,the author raises CSS DOM nodes classification based on multi-feature mining algorithm.At first the algorithm extracts Web page textual statistical characteristic,structure characteristic and visual information characteristic from CSS DOM nodes,then uses the machine learning classification algorithms to train classification models by the labeled feature data,and extracts same feature from the CSS DOM nodes on the new Web pages and uses the trained classification models to classify the new feature data at last.Experimental results show that the algorithm can separate main content nodes and noise content nodes on the new Web pages.They also show that using multi-feature effect is better than the effect using textual statistical feature or visual information feature only.For the main content of Web pages identifying problem,the author proposes precision and recall based CSS DOM node choice algorithm.The algorithm and the CSS DOM nodes classification based on multi-feature mining algorithm build the Web information extraction based on multi-feature mining algorithm together.Experiment results prove that the Web information extraction based on multi-feature mining algorithm has excellent effect and the robustness of it has better performance than the existing algorithms.

Keywords/Search Tags:

Web information extraction, visual features of Web pages, textual features of Web pages

Related items

1	Features Extraction And Duplicate Pattern Detection Of Web Pages
2	The Research Of Malicious Web Pages Detection Based On Multiple Features
3	Research And Realization Of Web Information Mining Model Based On Topic Features
4	Research Of Web Information Extraction Based On Features Of Multiple Pages
5	Research On Chinese Blog Pages Recognition And Content Extraction
6	Research On Visual And Textual Images Retrieval Methods Based On Extracting Salient Visual And Textual Features
7	Cultural differences in human-computer interaction: A content analysis and an experiment of design features of organizational home pages
8	Detecting Phishing Web-pages Based On The Spatial Database And Visual Layout Features
9	Research Of Semi-structured Data Extraction Based On Feedback Learning
10	Design And Implementation Of Objectionable Mobile Application Monitoring System Based On Textual Features