Font Size: a A A

Design And Research Of The Web Classification Scheme Based On SVM

Posted on:2015-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:L YeFull Text:PDF
GTID:2298330467463897Subject:Cryptography
Abstract/Summary:PDF Full Text Request
In recent years, web has quickly become the world’s largest public data sources, and how to position information fast and accurate from the vast has become an urgent problem to solve. The core problem is the automatic web classification. Web text classification come from web classification, is a major component of text mining. The process of web classification is classified by topic, established results database and generated classified information resources. On the one hand, customize classified information directory and hierarchical management recommendations to effectively improve the user’s search efficiency, fast and accurate positioning to the target Web. On the other hand, in order to achieve customization, filtering undesirable Web pages, features of interest can also be based on during the classification. So, the current mainstream web classification is text classification, through rational design of web representation and web text classification algorithm can achieve automatic classification.There are so many web text automatic classification algorithms, but the support vector machine (SVM) classification algorithm is the most popular and one of the best algorithm. In this thesis, we have design a complete web classification scheme based on SVM, and achieve an automatic web classification system. We also use multiple samples to make experiment and test the results of the system. Finally we have evaluated to verify the classification scheme feasibility.This study mainly completed several objectives as following:First, analyzed and summarized the subject of web classification, background, task and the paper structures. Second, systematically analyzed the key technologies and related theories, including data acquisition, data preprocessing, SVM classifiers. The data preprocess part is composed of web noising data preprocessing, text segmentation, feature selection, and feature preprocessing techniques. SVM is classification algorithms, by comparing the performance of SVM and KNN.Third, detailed design and research SVM classification algorithm web programs, including architecture design and detailed design. Architecture design is based on web design as the basis for the classification process, including needs analysis, goals, development environment and overall design. Detailed design is based on the idea of module partition. The system is divided into a database module, the user interaction module and classification module. Each module is designed by the specific detailed.Forth, gives an experiment based on the experimental results of web, analyze and optimize the system performance.Fifth, the innovation of this paper is the text pre-processing stage, in order to improve the accuracy of high pornography, violence, gambling, drugs and other priority categories, the paper before word of text preprocessing. First class is extracted pornography, violence, drugs and other types of sayings material, that know the URL of the appropriate category, after page parsing, extracting title, segmentation, calculate the frequency, in descending order, select the front appear to form a preset keyword Keywords table. Then the training samples and forecasting samples page parsing, extracting title keywords, and keyword table set in advance a good contrast, the matching success is given the appropriate words, will continue to match the content of the page unsuccessful segmentation, feature extraction, SVM classification, the conclusion that the classification results.In the end, the authors summarize and outlook the main results and the main work of this paper during the graduate.
Keywords/Search Tags:web page classification, text classification, web pagepurification, feature selection, support vector machine (SVM)
PDF Full Text Request
Related items