Font Size: a A A

Research Of Automatic Text Classification Method Based On Machine Learning

Posted on:2021-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2428330620964235Subject:Control engineering
Abstract/Summary:PDF Full Text Request
The classification of Chinese text documents has always been regarded as an important research topic in the field of NLP and ML.The original data in the database will continue to accumulate,and the number of Chinese documents is increasing rapidly every day.Most of the existing Chinese text classification technologies lack a more comprehensive feature selection method,or the classification index in the algorithm is too one-sided.Therefore,improving the method of such Chinese document classification algorithm is a solution to improve the practicality and effectiveness of Chinese text classification technology.Based on text classification,this article focuses on the key links in the text classification system.Specifically,the research is carried out from the aspects of feature dimensionality reduction and classifier algorithm:First,in the feature dimensionality reduction,according to the chi-square(CHI)statistical algorithm,the frequency of occurrence of terms and the cross-correlation of categories and the existence of the mutual information(MI)algorithm do not take into account the frequency of terms in the text,making the tendency For the problem of selecting some low-frequency feature words,by introducing the word frequency factor and the adjustment factor of the term at the same time,they proposed improvements,forming a new algorithm that is superior to chi-square statistics and mutual information,namely the CHMI algorithm;At present,the TF-IDF weight calculation method is often used in feature dimensionality reduction and feature word selection for text content.Aiming at the defect that the TF-IDF algorithm ignores the distribution relationship between feature words in the calculation,this paper proposes to use the chi-square statistical algorithm to combine the algorithms and form a more comprehensive quantification method of item importance,namely TF-CHI Feature selection algorithmAs a machine learning algorithm,XGBoost can be used to solve multi-classification problems.Using XGBoost algorithm can also get more accurate classification results.In this paper,the XGBoost algorithm in the application of Chinese text classification technology has low efficiency or difficult to deal with high-dimensional feature words and other problems.It is proposed to use the TF-IDF algorithm to pre-filter feature words in the classification process.Finally,an improved feature selection method based on XGBoost algorithm combined with TF-IDF algorithm is formed.Second,in the classification algorithm,the result of the algorithm based on support vector machine often has a strong contradiction between learning ability and generalization ability.This paper first classifies the kernel function from the perspective of the global kernel function and the local kernel function.According to the problem of a weak learning ability generally existing in the global kernel function and a weak generalization ability in the local kernel function,finally put forward a mixed kernel function based on a global kernel and a local kernel is specifically a mixed kernel function that uses a linear kernel and a Gaussian kernel for linear combination.The fifth chapter of this article is the experimental part of the full text.According to the above improvements,the experiments are designed and verified respectively.A large number of experiments have confirmed the effectiveness of the three improved feature selection algorithms and the hybrid kernel function has a stronger classification ability in terms of learning and promotion ability than the single kernel function.
Keywords/Search Tags:Text classification, Machine learning, feature selection, support vector machine, mixed kernel function
PDF Full Text Request
Related items