Font Size: a A A

Research On Semantic Structure-Based Text Representation And Classification

Posted on:2021-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:C F HongFull Text:PDF
GTID:2518306308984679Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Most of the text data are unstructured data.How to effectively process and accurately express the original information of the text is the first premise in the task of text classification.The process of transforming unstructured data into structured data is generally called text representation.The existing vector space model has the disadvantages of high-dimensional sparsity and lack of semantic information,Therefore,it is necessary and significant to study deeply the text representation model with semantic structure and class domain information,and to verify the effectiveness of its representation model in text classification tasks.The specific work is summarized as follows:1.The general method of text representation model based on word embedding combination is to embed weighted average words directly,which weakens the contribution of category information to classification performance.Therefore,a representation model based on class semantic structure is proposed.This model integrates text category information into text representation process,The text representation model has semantic information and maintains category structure.In this model,the word embedding space is divided into different class sub-spaces,and the representative feature words are selected in each class subspace,then the words corresponding to the feature words are embedded and combined to get the class feature vector,Finally,all the class feature vectors are cascaded to form the vector representation of the text.The model is tested in the long text and short text classification,and the results show that it has better classification performance than other weighted word embedding models.2.Aiming at reducing dimension of high-dimensional text representation in text classification,a linear regression classification method based on class-wise nearest neighbor dictionary is proposed.Based on the linear regression classification method and k-nearest neighbor method,the k-nearest neighbors of each class of training samples are selected to form the sub neighborhood dictionary of each class.According to different learning representation methods,the CCND-LRC model of concatenate dictionary from each kind of sub nearest neighbor dictionaries and CND-LRC model of learning test samples representation under each kind of sub nearest neighbor dictionaries are proposed.In addition,in order to alleviate the influence of noise data on classification performance,a kind of correlation measurement is designed.The noise data is cut by measuring the correlation between the test sample and each category.The experimental results show that CND-LRC has better classification efficiency and performance compared with other sparse representation methods,especially in long text,and CND-LRC based on cutting noise categories has more obvious results in more categories of text.
Keywords/Search Tags:Text representation model, Text classification, Word embedding, Semantic structure, Linear regression
PDF Full Text Request
Related items