Font Size: a A A

Research On Semantic Feature Based Text Classification Algorithom

Posted on:2017-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:B YuanFull Text:PDF
GTID:2348330518495545Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and Internet,our society has been influenced deeply and extensively.Massive electronic text is created and spreads through the Internet,as a result of the eruption of websites,social networking services(SNS)and e-commerce.Text is one of the most important media on the Internet.There is no doubt that it's necessary and valuable to study techniques of filtering,organizing,managing and mining the web text.As a frontier topic in the field of information processing,automatic text classification or automatic text categorization(TC)enables us to manage the massive text effectively and to locate the information that we are interested in quickly.TC techniques have been widely applied in the applications of information retrieval(IR),news classification,e-mail classification and public opinion analysis.Doubtlessly,TC has a bright prospect of application and is of high research significance.Vector space model(VSM)and topic model are the most popular methods to model text.Both of them are bag of words models which ignore the information of word order and word context.However,meaning often varies when word order is different,and words have different meanings in different context.Since text class label depends on the semantic meaning of the text,the semantic information ignored by the above two models is important for text classification.In order to overcome the weakness of VSM and topic model,this thesis takes advantage of some deep learning techniques to mine the semantic information within text.One of the advantages of deep learning is that it can learn abstract features(semantic features)through deep architectures.The deep learning techniques used in this thesis includes word embedding,recurrent neural networks,convolutional neural networks and so on.The main contribution of this thesis is as follow:First,this thesis proposes a negative sample based recurrent neural network language model(Neg-RNNLM)to train word embedding.After the detailed analysis of the problems of current word embedding methods,this thesis makes some improvement of the recurrent neural network language model(RNNLM).Neg-RNNLM is more efficient than RNNLM,and the quality of word embedding is better.Second,a text and knowledge base combined model is proposed to train word embedding.There are lots of useful and accurate semantic relations in knowledge bases(such as WordNet).The text and knowledge base combined model takes advantage of the semantic relations in WordNet to train word embedding,and more accurate word embedding is obtained.Third,this thesis compares three different methods of modeling document features based on word embedding.Three methods are:Paragraph Vector,CNN and RNN recurrent layer vector.The method of RNN recurrent layer vector has not been studied in previous work.The result of experiment shows that,CNN is better.Combining CNN and word embedding trained by Neg-RNNLM-graph model,stat-of-art text classification result on two benchmark datasets is obtained.
Keywords/Search Tags:text classification, semantic feature, word embedding, recurrent neural network, convolutional neural network
PDF Full Text Request
Related items