Font Size: a A A

Research And Application Of Similarity Calculation In Mixed Long And Short Texts

Posted on:2022-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:C L XuFull Text:PDF
GTID:2518306764467554Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Textual similarity calculation is one of the most important tasks of natural language processing.The development of social media has led to the increase in the number of short texts.And it results in the mix of long and short texts.It is urgent to solve the problem in similarity calculation of mixed long and short texts.Most of the existing studies focus on texts with little difference in length,including three types of models:representation structure,interaction structure and pre-training structure.The representation structure and the interaction structure use the same feature extractor,which cannot capture the differences of feature between long and short texts.The pre-training structure lacks the interactive features.Most of the existing search systems rely on matching by word segmentation without considering semantic relevance.To solve these problems,this thesis mainly focuses on the following three aspects.(1)In this thesis,a model which can calculate the similarity of mixed long and short texts by using a pseudo-siamese network is designed This method uses two different feature extractors to extract the features of long texts and short texts respectively.The feature extractor for long texts is Longformer,which avoids the problem of information loss introduced by splitting the long texts and reduces the calculation of the attention mechanism.The feature extractor for short texts fuses Bi LSTM and ABCNN dualchannel features.This method overcomes the difference in timing features and feature quantity between long texts and short texts,and improves the accuracy of textual similarity calculation.(2)In this thesis,a pre-training model with interactive features is designed to calculate the similarity between mixed long and short texts.This method combines the advantages of interaction structure and pre-training structure.It uses Transformer-XL to solve the long dependency problem of long texts.The permutation language model is used to represent texts and extract interactive features at the same time.This method adds GRU layer to learn text features deeply.It adds residual network to avoid the problem in network degradation.Therefore,it further improves the accuracy of the similarity calculation between mixed long and short texts.(3)In this thesis,a news search system with semantic matching is designed and developed.Based on the interactive pre-training model of textual similarity calculation in mixed long and short texts,it improves the semantic relevance between search targets and search results.After analysing the system requirements in detail,it has summary design,detailed design and database design.After that,a system is developed which can be used for analysing news on the Internet intelligently.It has user module,data management module,search module and data analysis module.It passes the function tests and performance tests of system.The system can run steady.The first model of similarity calculation designed in this thesis compensates for the shortcomings of the representation structure,and the second combines the advantages of the interaction structure and the pre-training structure.Both of which achieve high accuracy in calculating the similarity of mixed long and short texts.The news search system with semantic matching also provides the function of relevant data analysis.And it has some reference value in the field of public opinion analysis.
Keywords/Search Tags:Text Representation, Text Similarity, Pseudo-siamese Network, Interactive Features, Pre-training Model
PDF Full Text Request
Related items