The Research Of Webpage Denoising Method Based On Classification Technology

Posted on:2016-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:X J Li

Full Text:PDF

GTID:2308330479493286

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

All kinds of information on the network present a tendency of rapid growth because of the rapid development of the Internet. How to get effective information from the texts on the Internet, and analyze the data correlation between pages, are challenges in the field of natural language processing. A webpage consists of web navigation, subject information, hyperlinks, and copyright information, which in addition to the webpage of text information, other information is noise information, and those noises will affect the result of web information retrieval, web page classification and so on.In order to make better application of natural language processing technology to the study of web information, improve the handling capacity of the web text, and reduce the dependency for web page templates in the field of webpage de-noising, we propose a de-noising method that basing on the combination of tag location features and text features of a webpage. This method maps the HTML webpage to the corresponding DOM tree, extracts the location features and text features of each tag node on the basis of the analysis of the DOM tree structure. At the same time considers the semantic similarity of the text and title, divides the block that the DOM tree node belongs to into the text node and noise. Finally expresses these data as sample data and makes classification experiments with machine learning classification methods. This method is simple, and has a small dependency on web templates, has certain universality.The experiment selected Decision Tree, Na?ve Bayes and Support Vector Machine classification method, by contrast experiments verified the effectiveness of the method, and the experiment also obtained a higher accuracy, indicating that the method can be more accurately extract text information to remove noise information in the webpage. Finally, we analyzed the experimental results in detail, and make a summary of the cause of misclassification. At the same time, we made a feature selection experiment for a contrastive analysis, which indicated the relationship between the contributions of the selected features for the results and their time complexity, proved that the feature selection is important for a high accuracy and efficiency.

Keywords/Search Tags:

Natural Language Processing, Webpage Denoising, Text, Noise, Machine Learning

PDF Full Text Request

Related items

1	Intelligent Device Text Classification Method Based On Natural Language Processing
2	Research On Text Classification Based On Natural Language Processing And Machine Learning
3	Research And Application Of Text Classification Based On Natural Language Processing
4	Research On Machine Learning For Natural Language Processing And Transmission
5	Research On The Construction And Anal Sis Of Common Sense Corpora For Natural Language Generation
6	Research On E-Commerce Commodity Title Category Classification Algorithm Based On Natural Language Processing Technology
7	A Study On Attacks And Defenses For Machine Learning Models With Text And Log Data
8	Text Filtering Key Technologies
9	The Design And Implementation Of Hidden Hazard Analysis System Based On Natural Language Processing
10	Research On Internet Spam Identification Method