The Method Of Selecting Local Feature Words And Its Application In Text Classification

Posted on:2020-03-02

Degree:Master

Type:Thesis

Country:China

Candidate:B Liu

Full Text:PDF

GTID:2417330578973088

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In the highly informational society,the update and popularization of information technology and Internet technology are getting faster and faster,and the information resources represented in text form in electronic database also become more and more complex.Based on people’s basic cognition of information processing,automatic text classification technology has become the key technology to deal with large-scale and constantly updated text data.In order to acquire and utilize valuable information and knowledge from a large amount of complex information with high quality and efficiency,people have higher requirements for text classification technology,such as shorter computing time and higher classification accuracy.In this case,in addition to the improvement of the classifier,the feature selection technology can also improve classification efficiency and accuracy by performing dimensionality reduction and denoising processing on the data.In this paper,the existing classical feature selection methods are studied and analyzed.Aiming at the text classification problem of digital literature,a new feature selection method is proposed on the local structure based on document category.The method comprehensively considers the degree of correlation between keywords and document categories and the co-occurrence intensity between different keywords,and evaluates the importance of keywords from two aspects of “correlation-categories-keywords” and “cooccurrence-keywords-keywords”.Firstly,the contribution of keywords in classification is quantified based on random forest on the global structure.Secondly,the original keyword set is divided into some keyword subsets corresponding to the category by mutual information method.Thirdly,the correlation degree between different keywords is evaluated by the cooccurrence analysis method.Finally,the literature categories were combined in pairs and the co-occurrence information of keywords in local structure is extracted from the co-occurrence intensity matrix of keywords.Regarding the same categories and different categories as two cases,respectively,when the co-occurrence intensity of a pair of words is higher than the threshold value,the keywords with low contribution in classification need to be eliminated.The subsets of keywords corresponding to categories are merged to the global subset of keywords,which is used as the characteristic variable of vector space to get the new text representation model of data.In the experiment,the class-balanced self-collected data set and the classunbalanced public data sets are used to be reduced the dimension by the local feature selection method mentioned above,and the data set before and after feature selection are classified.By comparing the experimental results,it is proved that the local feature selection method can achieve better classification results in text classification.

Keywords/Search Tags:

Local feature selection, Co-word analysis, Random forest, Mutual information, Text classification

PDF Full Text Request

Related items

1	Chinese Text Categorization Method And Implementation
2	Research On High Dimensional Imbalanced Data Classification Based On Random Forest
3	EEG Signal Classification Based On Iterative Random Forest Algorithm
4	Research On Imbalanced News Text Mining Based On Improved Random Forest
5	Several Classification Algorithms And Their Applications In Statistical Learning
6	Research On The Evaluation Method Of Sports Effects Based On Feature Selection
7	Feature Extraction Of Video Titles On B Station And Analysis Of Factors Affecting Their Popularit
8	High-dimensional Data Based On MIC Feature Selection And Application Research
9	Research On Classification Of Imbalanced Datasets Based On Random Forest
10	Dimensionality Reduction Based On Feature Selection