Font Size: a A A

Analysis Of Text Information Based On Deep Learning

Posted on:2019-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y L SuFull Text:PDF
GTID:2438330566473383Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the growth of massive,diverse and fragmented information categories on the Interne,it is difficult to rapidly and accurately capture useful information.Thus,it is an urgent issue how to extract and represent text information for natural language processing.In addition,with the increasing development of Internet new medias,it is also an urgent problem how to classify original texts accurately and recommend somethings by judging users' interests.Therefore,this academic dissertation studies word segmentation,texts' vectorization representation,multi-feature integration and classification.This can not only help researchers further make some application studies on deep learning networks in natural language processing,but also provide some technical services for Internet new medias.The main works and the achievements acquired are summarized below:A.After two approaches of Maximum Matching(MM)and Hidden Markov(HM)are compared and analyzed with the aspects of their application scopes,superiority and shortcomings,an improved word-segmentation method is developed,which evaluates words' segmentation effects by sending the idea of word-tagging in the HM model to the MM model.Comparatively experimental results show that the richness of the content of a dictionary decides whether the MM method can effectively perform words' division;the HM method is a low-efficiency approach;the improved word-segmentation method can correctly divide texts' words with a high probability while being capable of effectively dividing ambiguous words.B.For the problems of text-vectorization and classification,an improved text-vectorization approach is first designed to reflect texts' feature information,in which a text vector is acquired by linearly weighting those feature vectors acquired by the TextRank keyword extraction approach and word2 vec in terms of the TF-IDF's word-frequency feature vectors.After that,an improved multi-feature integration based k-nearest neighbor algorithm is developed to carry out text classification,in which an adaptive correction rule on k is designed to decide the value of k given in the k-nearest neighbor algorithm by utilizing the max-class proportion in a small area and the densities of points.Comparative experiments indicate that the improved k-nearest neighbor algorithm is clearly superior to those compared approaches with the aspects of classification's effect and efficiency.C.Whereas the conventional classification approaches cannot solve the problem of large-scale data classification,a multi-feature fusion based text classification approach is proposed in terms of linked feature vectors.Here,each linked feature vector is acquired by linking a feature vector from the feature vectorization approach and another feature vector from the convolutional neural network,in which the latter feature vector is obtained by means of a feature matrix produced by the TextRank keyword extraction approach and word2 vec.Comparative experiments illustrate that the classification approach can effectively carry out text classification while its performance efficiency is superior to that of the conventional recurrent neural network.
Keywords/Search Tags:Keyword extraction, Text vectorization, Adaptive k-nearest neighbor classification algorithm, Convolutional neural network, Text classification
PDF Full Text Request
Related items