Font Size: a A A

Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification

Posted on:2022-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z K ZhangFull Text:PDF
GTID:2518306575968489Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
At present,internet technology is developing rapidly,and society is constantly striding forward to informatization.With the explosive growth of information,people can read a lot of news texts every day from the Internet,especially on mobile phones.News spreads widely and has a huge social impact,so the analysis and processing of news text is particularly important.In the face of massive news texts,carrying out automated classification processing,eliminating low-value or sensitive content,and recommending the same or related types of news to readers has become a hot research field.Automatic text classification is a key technology for news information processing.It can effectively organize information and quickly distinguish information categories according to users' needs.It is of great significance for creating a clean network environment and processing massive text information.This thesis aims to improve the efficiency and performance of text classification,starting with two types of text classification techniques,feature selection and text representation,and the main research contents are as follows:1.Aiming at the difficulty of calculation caused by high-dimensional features in the process of text classification,this thesis proposes an improved chi-square feature selection method CICHI based on the traditional CHI algorithm.CHI feature selection does not consider the word frequency of feature items,and exaggerates the weights of feature items that rarely exist in target categories but often appear in non-target categories.The improved method not only extracts word frequency information from within classes,but also pays attention to word frequency information between classes,so as to comprehensively supplement word frequency factors in addition to document frequency.Moreover,the identification of negative correlation features is improved,and interference features are eliminated with the help of information entropy,so that the resulting feature items appear more frequently in the target category than in the non-target category.Experimental results show that CICHI is better than traditional feature selection algorithms,and can select high-quality feature subsets.While ensuring classification accuracy,it reduces the size of feature subsets,thereby improving the efficiency of text classification.2.Aiming at the problems of sparse text features and unsatisfactory feature extraction effects,this thesis proposes a text representation model that combines topic vectors and word vectors based on the word vector model and topic model.From the word level and the full text level,the abstract information of the text is expressed in multiple granularities,and the feature matrix that reflects the contextual semantics and the fusion theme features is obtained.Then,with the help of convolutional neural network,the interrelationship between words and words,full text and full text is strengthened,and text classification with high discrimination is carried out.The word vector model refines the features into word levels,and trains separate word vectors for each word.The semantic information obtained is all from the direct superposition of word vectors,without considering the overall situation.The semantic association between words weakens the differences between words;the topic model establishes the topic distribution of the text,which provides conditions for grasping the overall semantics at the text level.Here,related topics are obtained through topic distribution weight vectors,combined with word vector features,and the experimental results show that the proposed model can improve the effect of text representation,achieve better accuracy,and improve the performance of text classification.
Keywords/Search Tags:text classification, feature selection, text representation, convolutional neural network
PDF Full Text Request
Related items