Font Size: a A A

Short Text Classification Method Combining Statistical Information And Conceptual Information Of Knowledge Base

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2518306122468744Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As a key and basic research in natural language processing,short text classifycation plays an important role in the fields of recommendation systems,question answering systems,and sentiment analysis.In the era of network information,short texts are an indispensable carrier in the process of daily communication and information sharing.However,short texts have the characteristics of short length,irregular syntax,sparse semantics,and lack of contextual knowledge.In response to these problems,researchers have tried to use statistical information to enhance the classification feature representation of data sets.For example,the TF-IDF weighting algorithm is used to enhance the feature items that are helpful for classification,while weakening the representation of useless feature items.However,the statistical information of a single small data set cannot effectively describe the importance of the feature items,and it is an attemptable method to use the statistical information of a large-scale knowledge base(eg: Wikipedia knowledge base,Google knowledge base)to enhance its semantic features.Furthermore,most previous studies have improved the word embedding model and classification model,ignoring the limited expressiveness of short text datasets,sparse semantics,and the ambiguity of words themselves.If the prior knowledge can be obtained from a knowledge base other than the data set to improve the expressiveness of the data set,then the classification of short texts will be more effective.Based on the above findings,this paper uses the statistical information of the Wikipedia knowledge base to propose two feature weighting schemes to characterize the importance of feature terms and enhance the ability of the sample to express semantics.Further,this paper makes use of the existing knowledge base to obtain the relevant conceptual knowledge of short texts to improve the problem of lack of background knowledge of texts.Specifically,the main work of this article is as follows:(1)Based on the idea that the statistical knowledge in the large-scale knowledge base can effectively describe the importance of words,the word frequency of the Wikipedia knowledge base is counted to obtain the statistical knowledge of the largescale knowledge base.(2)Based on the statistical knowledge obtained in(1),this paper proposes two feature weighting schemes.And through experiments proved that these two weighting schemes are effective.(3)With the help of the Probase knowledge base,the relevant concepts of short text words in the knowledge base are obtained to enrich the expression of words,improve the ambiguity of words,and solve the problem of lack of background knowledge to a certain extent.(4)Synthesizing the statistical knowledge of the Wikipedia knowledge base and the conceptual knowledge of the Probase knowledge base,the CAE-CNN model is proposed on the basis of the convolutional neural network(CNN)model,and the model is proved to be effective through experimental results.(5)Based on the deep learning method,six variant methods related to the CAECNN model are proposed.By comparing and analyzing the experimental results of these six variant methods with the CAE-CNN method,it is proved that these six variant methods are competitive in enhancing the short text classification effect.
Keywords/Search Tags:short text classification, knowledge base, weighting scheme, statistical knowledge, concept knowledge, convolutional neural network, convolution kernel
PDF Full Text Request
Related items