Font Size: a A A

The Research Of Webpage Denoising Method Based On Classification Technology

Posted on:2016-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:X J LiFull Text:PDF
GTID:2308330479493286Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
All kinds of information on the network present a tendency of rapid growth because of the rapid development of the Internet. How to get effective information from the texts on the Internet, and analyze the data correlation between pages, are challenges in the field of natural language processing. A webpage consists of web navigation, subject information, hyperlinks, and copyright information, which in addition to the webpage of text information, other information is noise information, and those noises will affect the result of web information retrieval, web page classification and so on.In order to make better application of natural language processing technology to the study of web information, improve the handling capacity of the web text, and reduce the dependency for web page templates in the field of webpage de-noising, we propose a de-noising method that basing on the combination of tag location features and text features of a webpage. This method maps the HTML webpage to the corresponding DOM tree, extracts the location features and text features of each tag node on the basis of the analysis of the DOM tree structure. At the same time considers the semantic similarity of the text and title, divides the block that the DOM tree node belongs to into the text node and noise. Finally expresses these data as sample data and makes classification experiments with machine learning classification methods. This method is simple, and has a small dependency on web templates, has certain universality.The experiment selected Decision Tree, Na?ve Bayes and Support Vector Machine classification method, by contrast experiments verified the effectiveness of the method, and the experiment also obtained a higher accuracy, indicating that the method can be more accurately extract text information to remove noise information in the webpage. Finally, we analyzed the experimental results in detail, and make a summary of the cause of misclassification. At the same time, we made a feature selection experiment for a contrastive analysis, which indicated the relationship between the contributions of the selected features for the results and their time complexity, proved that the feature selection is important for a high accuracy and efficiency.
Keywords/Search Tags:Natural Language Processing, Webpage Denoising, Text, Noise, Machine Learning
PDF Full Text Request
Related items