Font Size: a A A

A Study Of Subject Web Classification Algorithm Based On Machine Learning

Posted on:2016-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2428330473464979Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and information technology,kinds of digital information are filled with every corner of society,including the text information which plays a key role in our society.How to manage text efficiently becomes a hot topic in research,automatic text classification emerged as required.But the performance of automatic text classification is still relatively low,which has a very big space for improvement.Text classification is process of supervised learning,involving machine learning,data mining and other key technologies.Many factors affect the performance of text classification,including text pre-treatment,feature extraction,dimension reduction,text representation,classifier design,evaluation criteria and so on.Due to high dimension and high sparsely characteristics of the traditional text representation model,designing an efficient text representation model and reducing the dimension of text representation become the focus in the field of text classification.The research based on the deep analysis of automatic web page text classification technology,combined with the characteristics of food industry,designs a Web theme classification algorithm based on support vector machine and HR-VSM model.Firstly,this paper reviews the current text classification technology development status at home and abroad,introduces the definition of text classification,text representation model and process,the classic machine learning algorithms.Then study the characteristics of web page text extraction method.Considering the weight of web page theme in different positions,based on the feature extraction algorithm,improve algorithm of web page text feature extraction and weighted.Based on the similar content of the web page with the original web page,this paper proposes a novel web text vector model--HR-VSM model.Next based on support vector machine and HR-VSM model,design a web topic classification algorithm.The algorithm describes the theoretical basis of the method and procedure based on classification of this algorithm.Finally,this paper verifies the performance of the improved model and algorithm and prototype model from the simulation experiment collected from a network of food-related web pages,a total of 3994 Chinese documents,including 2794 articles for training,the remaining 1200 for testing.The classification results show that,this model can effectively improve the efficiency of classification;which proved that the algorithm is effective.At the same time,the algorithm has a good reference value to other industries.
Keywords/Search Tags:Web subject classification, Web text classification, feature selection, support vector machine, vector space model, link analysis
PDF Full Text Request
Related items