Font Size: a A A

Research On Feature Weighted Multinomial Naive Bayes Algorithms And Applicaitons

Posted on:2022-10-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:S F RuanFull Text:PDF
GTID:1480306563958439Subject:Earth Exploration and Information Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,text data in various professional fields is growing explosively.How to mine useful information from unstructured text data has become a challenge.As the key technology of processing and organizing a large number of text data,text classification has been widely used in various professional fields.Common text classification algorithms include naive Bayes,decision tree,support vector machine,deep learning and so on.Among them,multinomial naive Bayes is widely used in text classification because of its simplicity and high efficiency.However,one of the main assumptions of multinomial naive Bayes is that feature variables are independent of each other given document variables,which is often difficult to hold in reality.At the same time,with the constantly updated text data showing a variety of new features,such as nonlinear structure,category imbalance,data redundancy,etc.,the traditional algorithm can not achieve the ideal balance in classification accuracy and time complexity.Aiming at the above problems,this paper studies multinomial naive Bayes text classification algorithm from three aspects: general feature weighting,class dependent feature weighting and mixed feature weighting.Firstly,around the problem of nonlinear association between features and categories,combined with the idea of inverse document frequency information and depth feature weighting,the distance correlation coefficient is improved,and an feature weighted multinomial naive Bayes text classification algorithm based on improved distance correlation coefficient is proposed.Secondly,for the general feature weighting,the contribution difference of features to different categories is ignored in the weighting process Based on the theory of weighted features of class dependence and the characteristics of text,this paper extends the chi square statistical theory,and proposes a text classification algorithm based on weighted multinomial naive Bayes of class dependence features.Then,aiming at the redundancy problem of features and features in text data,this paper introduces a fast feature selection method,extends the mutual information theory,and proposes a mixed feature selection and weighting multinomial naive Bayes text classification algorithm.Finally,the practical application of the new algorithm in geological text data classification is discussed.The main research work of this paper is as follows(1)This paper proposes an improved distance correlation coefficient based feature weighted multinomial naive Bayes(IDCWMNB)algorithm for text classification.In the process of setting feature weights,the algorithm starts from the feature distribution function between features and categories,combines the heterogeneous characteristics of text data itself,improves the distance correlation coefficient by introducing inverse document frequency,and proposes a new weight measurement function.The improved weight measurement function better describes the dependency relationship between features and categories.This paper compares the performance of the new algorithm with the classical multinomial naive Bayes algorithm on a large number of standard text classification data sets.(2)This paper proposes a class dependent feature weighted multinomial naive Bayes(CDFWMNB)text classification algorithm.The algorithm introduces the idea of class dependence,and sets different weights for each feature in different categories.The weight measurement considers the distribution of the same feature in different categories,the distribution of different features in the same category,and the overall distribution of features and categories.Compared with the traditional one-dimensional weight vector,the two-dimensional weight matrix generated by the new algorithm contains more comprehensive information and more accurate description.This paper compares the performance of the new algorithm with the classical multinomial naive Bayes algorithm on a large number of standard text classification data sets.(3)This paper proposes a mixed feature selection and weighting multinomial naive Bayes(MSWMNB)text classification algorithm.Firstly,the redundant features are filtered by fast feature selection method,and then the selected features are weighted based on the improved mutual information theory.One of the innovations of the new algorithm is that in the process of feature selection and weighting,the redundancy between features is considered in the evaluation function.In order to solve the problem of high computing cost caused by redundancy,the idea of fast feature selection is introduced into the text classification problem.The second innovation is that in the process of feature weighting,the mutual information is expanded by combining the word frequency information,and a new weight metric is proposed.This paper compares the performance of the new algorithm with the classical multinomial naive Bayes algorithm on a large number of standard text classification data sets.(4)The application effect of the new algorithm in geological text data classification is discussed.Through the classification of engineering geological exploration text and mineral geological exploration text,it is found that the new algorithm proposed in this paper can timely and accurately find the required data in the massive professional text data,mine the association knowledge contained in the data,adapt to the changing application environment and the needs of thematic retrieval,and has better effect than the traditional multinomial naive Bayesian algorithm.The results can provide knowledge support for geological engineering and strategic research of geological and mineral resources.
Keywords/Search Tags:multinomial naive Bayes, feature weighting, distance correlation coefficient, class dependence, text classification
PDF Full Text Request
Related items