Font Size: a A A

Text Representation Model And Feature Selection Algorithm

Posted on:2018-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2348330515497253Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Text categorization as an effective tool to deal with unstructured information,has been widely used in machine learning and information retrieval.However,because of high dimension and high sparsity in text feature,the effect and speed of text categorization are highly dependent on the selection of feature selection methods and text representation model.This thesis studies text feature selection and text representation model,and main work is as follows:1)Traditional statistical-based feature selection methods does not take into account semantic of features.In this thesis,feature selection methods based on LDA word vector,Word2vec word vector are proposed.They learn the semantic concept of feature from topic and word context respectively.After feature selection is completed,the vector space model is used to classify corpus.Experimental results on Fudan's corpus show that the proposed feature selection methods based on word vector improve classifying quality compared to traditional feature selection.Moreover,word-based feature selection are unsupervised methods,without the need to label data sets.2)The LDA model(Latent Dirichlet Allocation)does not process input feature of corpus.Because it contains a large number of words that do not make sense to topics,which may affect the quality of topics.For this reason,this thesis proposes a text feature selection based on genetic algorithm.It reduces dimension of the original feature space.Thus,LDA can distribute topic in more meaningful feature space.Classification experiment result of Fudan corpus is improved and genetic algorithm proposed in this thesis is not necessary to determine the proportion of feature selection.There are partially spam topics in LDA-generated topics and some topics are collections of irrelevant feature words.The main method to find meaningful topic is by manual inspection.For Topic automatic sorting method,there are only TSR(Topic Significance Ranking)currently used in this field.TSR method steps are fussy and only consider the distance between topic and garbage topic.In order to sort the importance of the topic,this thesis proposes a method named Maximum garbage topic distance and minimum Similarity Topic Significance Ranking.Experimental results show that the topic significance ranking method proposed in this thesis is simple and efficient.It can identify meaningful topics.3)LF-LDA(latent feature-LDA)combine with word vector to train model and text categorization effect of LF-LDA is better than LDA.Based on LF-LDA model,this thesis proposes a text representation model based on LF-LDA and Word2vec.It uses the distance between topic vector generated by LF-LDA and document vector represented by Word2vec for text representation.In addition,a text representation model based on topic vector is proposed.By weighting composition of topic vectors generated by LF-LDA,document represent can be obtained.Experiment results on StackOverflow short text datasets show that LF-LDA combined with Word2vec text representation is superior to LF-LDA,LDA combined with Word2vec.The categorization results of text representation model based on topic vector is similar to that of LF-LDA.
Keywords/Search Tags:Text categorization, feature selection, genetic algorithm, LDA, word vector, Word2vec, LF-LDA
PDF Full Text Request
Related items