Font Size: a A A

Research On Semantic Similarity Calculation Method And Data Augmentation In Chinese Short Text

Posted on:2022-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2518306554450484Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Short text semantic similarity calculation is a key technology in the field of natural language processing,and it has been widely used in the fields of intelligent customer service question and answer,natural language reasoning,text information retrieval,and automatic scoring.However,the current research on the semantic similarity calculation of Chinese short texts still faces many challenges:(1)The length of Chinese short texts is short,and the effective information contained is limitedmaking the text features sparse,and the traditional model based on the twin network construction The ability to extract features of Chinese short texts is limited,and the specific manifestation is that it is difficult to fully extract the associated information between different words in the text and the interactive information between the text and the text at the same time;(2)The addition of deep learning technology improves the semantic similarity of the model calculation However,this type of model usually has a complex structure,and its quality depends on the scale and quality of the training data,especially the labeled data.In the real scene,the process of acquiring tag data is time-consuming and labor-intensive,making the data scale not large enough and the number of tag categories is unevenly distributed,resulting in poor model performance.In response to the above problems,this article proposes corresponding solutions,the specific contents are as follows:(1)Aiming at problem 1,propose a fusion twin network and Roberta preptraining model SRoberta-SelfAtt.First,in the twin network architecture,the original text pairs are encoded into word-level vectors through the Roberta pre-training model,and the self-attention mechanism is used to capture the associations between different words in the text;then,through the global maximum pooling and global average The pooling strategy separately obtains the sentence vectors of the text pairs and concatenates them,and then interacts and merges the representation results;finally,the loss value is calculated by softmax in the fully connected layer to evaluate the semantic similarity of the text pairs.Experiment with this model on three data sets(AFQMC,LCQMC and OCNLI data sets)under two types of tasks(intelligent question matching task and natural language inference task),and the F1 value on the test set of the first two data sets Reached 80.05%and 84.5%,respectively.The average accuracy rate on the validation set of the latter data set reached 76.1%.The experimental results have been improved compared with other models,thereby further optimizing the accuracy of text semantic similarity calculations.Provide an effective basis.(2)Aiming at problem 2,a hybrid text augmentation method based on simple text augmentation method EDA and text generation model LaserTagger is proposed.First,the ori ginal text is augmented by the random exchange strategy in the EDA method;then the text pair composed of the EDA augmented text and the original text is sent to the LaserTagger model to obtain the paraphrase text about the input text pair,which is obtained by the hybrid method Ultimately augment the text.The method proposed in this paper is used in the AFQMC data set,and the part labeled 1 is augmented.The augmented text and the original text are combined as the training set of the SRbert-SelfAtt model.The F1 value reaches 86.71%Compared with the original unaugmented data,the data augmentation strategy based on the back translation method and the random exchange method in EDA,the increase was 8.3,0.9 and 3.5 percentage points respectively.Experimental results show that the method proposed in this paper can effectively augment data,thereby improving the accuracy of model calculation to a certain extent.
Keywords/Search Tags:Short text, Semantic similarity, Roberta model, Siamese network, Data augmentation
PDF Full Text Request
Related items