Font Size: a A A

Research On Text Classification Algorithms Based On Machine Learning

Posted on:2020-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:S C RenFull Text:PDF
GTID:2438330620955591Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,text data is enriched by tens of thousands of features.Faced with such a huge amount of text resources,it is impossible to effectively obtain valuable information from these texts by relying solely on traditional manual text classification methods.In recent years,with the maturity of machine learning technology,machine learning algorithm has been used to classify text automatically,which has become a hot and difficult research topic in the academic circles.As the main branch of data mining technology,the requirement of automatic text categorization can be effectively met by text categorization technology under the development of large data.However,in a large amount of data,it is not as simple as imagined to classify text efficiently and accurately.Generally,it needs to be realized by text preprocessing,feature selection,feature weighting and classification algorithm selection.In these steps,there are still many deficiencies in the related algorithms used in each link.Among them,feature selection algorithm and text categorization algorithm are the two most critical parts of the text categorization process,which directly affect the final classification performance of the classifier.Therefore,these two parts are separately studied in this paper,and the main work is as follows:1.In this paper,the whole process of text categorization is deeply studied.Because feature extraction is particularly important in the whole process of categorization,the advantages and disadvantages of four commonly used feature selection algorithms are summarized and analyzed in detail.In addition,different classifiers are used in English datasets 20 NewsGroup and Reuters to verify the experimental results.The experimental results show that the feature extraction effect of the chi-square check algorithm is the best,so at the end of this paper,the chi-square check feature selection algorithm is used for feature extraction.2.Aiming at the problem that the information gain of feature documents in categories is not considered by traditional TF-IDF algorithm,the information gain of feature items for categories is introduced into TF-IDF algorithm,and a naive Bayesian classification algorithm based on TF-IDF~*IGD weighting is designed in this paper.Firstly,the information entropy of each category is calculated,then the conditional information entropy of each feature document in each category is calculated,and the information gain of words in each category is calculated by using the difference between the two.The information gain is reflected in the weight,so as to improve the classification performance.The related simulation experiments are carried out on the English dataset 20 NewsGroup and Reuters.The experimental results show that the macro F1 value of the improved algorithm is better,and the overall classification performance evaluation index is improved by 2%.3.Aiming at the disadvantage of TF-IDF~*IGD algorithm that its contribution can not be accurately represented by feature weights,a naive Bayesian classification algorithm based on IGDC weights is designed,starting with feature two-dimensional information gain,combining feature text information gain and feature category information gain to accurately measure the weight.Firstly,the information gain of feature categories is calculated,and then the information gain of text categories containing features is calculated.Finally,the two are multiplied and normalized.The related simulation experiments are carried out on the English dataset 20 NewsGroup and Reuters.The experimental results show that the macro F1 value of the improved algorithm is better,and the overall classification performance evaluation index is improved by 5%.
Keywords/Search Tags:Text Classification, Feature Selection, Naive Bayes, Two-Dimensional Information Gain, Weighting Algorithm
PDF Full Text Request
Related items