Research On Classification Method On Chinese Short Texts With Few Words Based On Feature Representation

Posted on:2021-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Yue

Full Text:PDF

GTID:2428330614960455

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Short text classification has achieved a great progress in recent years.However,most of existing methods only focus on the data such as Twitter or Weibo,where the text length is typically no more than 300 characters.But in real-word applications,such as news titles,invoice names and other text classification problems,texts have the characteristics of an extremely short length,sparse features,and ambiguous semantics and so on,which makes it difficult for existing short text classification methods to achieve ideal results.Regarding the above problems in short texts,in terms of the feature representation learning method,this dissertation focuses on the classification of short text with few words such as news headline and invoice name.The main work is as follows:(1)Aiming at the problem of extremely short and sparse features of invoice name text,extremely short Chinese text classification method based on bidirectional semantic extension(BSE-ESTC)is proposed.Firstly,in order to alleviate the problem of sparse features caused by the extremely short text,bidirectional synonym query is conducted for words and class label attribute words in the text,and then these synonymous words are added to the word segmentation results,and the semantic expansion is performed to achieve the expansion of the extremely short text.Secondly,in order to avoid the problem of semantic ambiguity caused by sparsity,hash vectorization method is used to vectorize and classify the word segmentation results.Experimental results show that the method has excellent performance on tax invoice data.(2)Aiming at the problem that the semantics of Chinese short texts with few words are highly ambiguous,combining the attention mechanism with the selection of character embedding features,a hybrid classification method(AFC)of Chinese short texts with few words based on character embedding is proposed.This method first uses Chinese character embedding to vectorize the text.Then it superimposes the attention mechanism on the feature representation learning model to assign different weights to each word,and the weights for useful classification are larger,and vice versa,which aims to improve the effect of keyword recognition,and then improve the classification accuracy.Secondly,based on the word vector representation to calculate the semantic similarity between the content and the class label to avoid semantic ambiguity.Finally,feature selection is used by the weight of each word to remove meaningless disturbing features and improve the quality of feature vectors,which aims to improve the classification accuracy.

Keywords/Search Tags:

Short text classification, Short text with few words, Feature representation, Attention mechanism

PDF Full Text Request

Related items

1	Design And Implementation Of Chinese Short Text Classification Method
2	Research On Short Text Classification
3	Research On Chinese Short Text Representation And Classification
4	Research And Application Of Chinese Short Text Classification Algorithm Based On Deep Learning
5	Feature Weight Optimization For Short-Text Multiclass Classification
6	Research On Short Text Classification Method Based On Text Graph Structure
7	Attention-based Short Text Classification
8	Research On Short Text Classification Based Upon Convolution Feature Encoding And Attention Mechanism
9	Research On Short Text Classification Method Based On Feature Extension
10	Research On Short Text Classification Of Chinese News Based On Machine Learning