Font Size: a A A

Research On Classification Method On Chinese Short Texts With Few Words Based On Feature Representation

Posted on:2021-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z YueFull Text:PDF
GTID:2428330614960455Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Short text classification has achieved a great progress in recent years.However,most of existing methods only focus on the data such as Twitter or Weibo,where the text length is typically no more than 300 characters.But in real-word applications,such as news titles,invoice names and other text classification problems,texts have the characteristics of an extremely short length,sparse features,and ambiguous semantics and so on,which makes it difficult for existing short text classification methods to achieve ideal results.Regarding the above problems in short texts,in terms of the feature representation learning method,this dissertation focuses on the classification of short text with few words such as news headline and invoice name.The main work is as follows:(1)Aiming at the problem of extremely short and sparse features of invoice name text,extremely short Chinese text classification method based on bidirectional semantic extension(BSE-ESTC)is proposed.Firstly,in order to alleviate the problem of sparse features caused by the extremely short text,bidirectional synonym query is conducted for words and class label attribute words in the text,and then these synonymous words are added to the word segmentation results,and the semantic expansion is performed to achieve the expansion of the extremely short text.Secondly,in order to avoid the problem of semantic ambiguity caused by sparsity,hash vectorization method is used to vectorize and classify the word segmentation results.Experimental results show that the method has excellent performance on tax invoice data.(2)Aiming at the problem that the semantics of Chinese short texts with few words are highly ambiguous,combining the attention mechanism with the selection of character embedding features,a hybrid classification method(AFC)of Chinese short texts with few words based on character embedding is proposed.This method first uses Chinese character embedding to vectorize the text.Then it superimposes the attention mechanism on the feature representation learning model to assign different weights to each word,and the weights for useful classification are larger,and vice versa,which aims to improve the effect of keyword recognition,and then improve the classification accuracy.Secondly,based on the word vector representation to calculate the semantic similarity between the content and the class label to avoid semantic ambiguity.Finally,feature selection is used by the weight of each word to remove meaningless disturbing features and improve the quality of feature vectors,which aims to improve the classification accuracy.
Keywords/Search Tags:Short text classification, Short text with few words, Feature representation, Attention mechanism
PDF Full Text Request
Related items