Font Size: a A A

The Implementation Of Large-Scale Chinese Website Classification System Based On Improved SVM Algorithms

Posted on:2020-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:T S ZhangFull Text:PDF
GTID:2428330572973600Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet,people's demand for website retrieval is increasing.Classifying websites can greatly improve retrieval efficiency.Therefore,there is a great value to study the automatic classification technology of Chinese websites.Support Vector Machine(SVM)algorithm based on keyword frequency of web pages has the advantages of flexibility and convenience,and has good universality for learning Chinese keywords in Web pages.SVM technology can well support large-scale classification of Chinese websites.At present,SVM algorithm has been applied to website classification work,but because of the inefficiency and low accuracy with old classification method and poor support for Chinese,it can not complete the large-scale coverage of millions of Chinese websites in China.In order to solve the above problems,this paper improves the SVM machine learning model to solve the problem of large-scale Chinese website classification,and realizes the classification system of domestic Chinese websites for millions of data sets based on the above methods.This paper explores the classification accuracy of websites with millions of data,and studies the influence of relevant parameters of SVM algorithm on the classification accuracy.Aiming at the problem that the unbalanced sample set has an effect on classification accuracy in traditional SVM algorithm,this paper tries to improve the hyperplane to adj ust the algorithm model.By introducing a new parameter control model,the hyperplane is approached to the positive class sample,leaving a larger possibility space for the negative class sample,so as to alleviate the unbalanced sample set's influence on classification accuracy.The impact of the rate.The experimental results show that when the improved algorithm is applied to the non-equilibrium sample set,the classification accuracy of negative samples has been improved obviously within a certain range of parameters,thus improving the overall classification effect.Based on the above improved SVM algorithm,a large-scale Chinese website classification system is constructed to complete a large-scale website classification work of millions of orders of magnitude in China.The system includes five basic modules:data acquisition,data processing and storage,data calculation,data classification,result display and query.It completes the integrated classification work from website information crawling,information storage,data preprocessing to algorithm testing and optimization,and then to algorithm application and result display.After testing,the system meets the application requirements and achieves good classification results.
Keywords/Search Tags:support vector machine, text segmentation, inbalanced sample, optimal hyperplane, multi-classification
PDF Full Text Request
Related items