Font Size: a A A

Research On Text Classification Based On Machine Learning

Posted on:2018-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y F SunFull Text:PDF
GTID:2348330563952610Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the explosive development of the network,there are more options and more convenient ways for people to get information,also bring challenges,Facing the various and dazzling articles on the Internet,how can people find the information they need quickly and efficiently has been a hot topic.Therefore,moderate classification is necessary for large network text,text classification can effectively improve the user to browsing the web,finding information,access to knowledge convenience and quickly,text classification has been used in the digital library,search engine and other fields.Automatic text categorization are based on machine learning,and machine learning is divided into supervised and unsupervised learning methods,so for text classification research is also based on supervised and unsupervised learning proposed many machine learning model and the basic model of the improved algorithm.In addition to the algorithm is improved,the most important development of text classification is have changed from only calculate keyword simply to semantic meaning and context by understanding and text information,and put forward a lot of methods by studying the document structure and position.Because of the uniqueness of Chinese,text classification is different in Chinese than in other languages,Chinese polysemy,synonyms and synonyms characteristics has certain difficulty in the extract.Classification in this paper using the potential of Dirichlet distribution(Latened Dirichlet Allocation,LDA),it's a generation model,theme generation model is not only affected by the number of words in a document,but also location and so on,and that regardless of the length of an article,is around one or more of the topic,as long as to extract the article topic can show the category of the article.This article page text is divided into two categories,the link text and social text,to put forward on the basis of the LDA model to increase the link text for the calculation of link relations between documents,to improve the accuracy of classification;For social text,in view of its own can extract content less,colloquial,life and other characteristics,first used in processing the LDA model generation subject distribution,using the decision tree classification model.The main work of the paper is as follows:(1)first of all,the general process of text categorization,including the text pretreatment,participles,denoising,such as study,analyses the main methods and research status of text categorization,the model of the subject matter at the same time carried on the thorough study and draw lessons from.(2)the link relationship model(LRM)is proposed to handle the link text.LRM model is joined to the text in the LDA model on the basis of the calculation of link relation,improve the calculation accuracy,and using the variational distribution calculation and maximum likelihood to estimate the parameters of methods,control increases the amount of calculation.(3)the LDA ID3 hybrid model for social text was proposed.Social text has rarely used word,wrong character is more,the characteristics of less desirable characteristics so that the classification of the traditional method is no longer applicable,and theme model has a natural advantage when dealing with short text,but the generated model on the classification result is unsatisfactory,therefore,this article combined with the theme of the LDA model and classification model ID3 decision tree algorithm to realize the advantages of the handling of social text.At the end of the article,through the experimental analysis shows that the structure of the increase in the model of calculation or hybrid model will lead to the increase of the amount of calculation,but take the corresponding control measures,although have a loss of efficiency,but on the performance of the classifier,and classification has a certain effect...
Keywords/Search Tags:Text Classification, LDA Topic Model, Decision Tree Model, LRM, LDA-ID3
PDF Full Text Request
Related items