Font Size: a A A

Researching Text Classification Using Semantic And Sequence Information

Posted on:2019-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q NieFull Text:PDF
GTID:2348330563453946Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,a large number of digital media such as blogs,microblogs,forums,and news websites have appeared as texts.How to use this information to analyze user behavior,recommend content to users,and provide services to users are very important and valuable research.Therefore,automatic text categorization as a basic text task has become a hot topic for people to study.Among the more important ones are studies of well-designed text representations and the establishment of models with wide applicability.Based on the in-depth study of text semantics and text sequence information,this paper proposes a multi-granularity text learning method and a universal text representation model for texts of different lengths.Based on this,new sentence and document classification models are proposed.In the aspect of text semantic learning,this paper deeply analyzes the essential meaning of the distributed information encoded by the word vector based on learning the principle of word vector,namely a comprehensive coding of general language features including semantics,grammar,pragmatics and part of speech.And this feature of the word vector has nothing to do with the learning model,data set and dimensions;at the same time,based on the further analysis of the relevance of word vector learning and text tasks,it is proved that the word vector learning and text classification tasks can be optimized together through text classification experiments,helping to improve the effectiveness of text categorization tasks.In terms of text sequence information,the nature of text sequence information is analyzed in two granularities of words and sentences;the word-level ordering reflects the syntax,grammar,and other information;the order of the sentence level reflects the writing logic of the article.Then we have studied the learning method of sequence information in two aspects of global sequence information learning and local sequence information learning.In terms of text representation,a multi-granularity text learning method and a supervised universal text representation model are proposed.In terms of text categorization,two sentence classification models are proposed based on the universal text representation model: LSTM-WSM and CNN-WSM,and good results have been achieved in the task of sentence classification;at the same time,two double layers document model are proposed based on learning sentence and document representation: Independent Bi-Level Text Classification Model(IBLM)and Independent Increasing Representation and Prediction Model(IIRPM).Finally,the test accuracy rates of the document classification models IBLM and IIRPM proposed in this paper on Fudan news datasets are 94.7% and 95.8%,respectively,of which 95.8% is the best results so far;at 20 Newsgroups,the test accuracy are 74% and 73.1%,respectively,of which 74% is the best result so far.However,the result of the sentence classification model presented in the sentence classification task does not exceed the best effect.Next,we will explore more text semantic and sequence information learning models in order to achieve better results.
Keywords/Search Tags:text semantics, sequence information, text representation, multi-granularity text learning, text classification
PDF Full Text Request
Related items