Font Size: a A A

Research On Text Similarity Calculation Method And Its Application In Financial Field

Posted on:2022-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:H T WangFull Text:PDF
GTID:2518306524493904Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the highly development of the Internet,more and more information is distributed on the network.Especially with the arrival of 5G,the amount of information in the network will expand,thus users are eager to get content from the vast amount of information which they are interested in.How to retrieve such massive information,how to obtain the related information of the same kind,and how to avoid the push of repeated information,all these are inseparable from the text similarity calculation and its related algorithms.In recent years,many models have been proposed based on deep learning to measure the similarity between the texts,but this kind of model input is usually short text,and the articles in daily life are usually long texts that we can't use deep learning model to calculate the similarity between the articles directly.Therefore,this dissertation uses the correlation algorithm of abstract extraction to compress the long text following useing the deep learning model to calculate the similarity between the articles,and uses the keyword based topic information as the additional features.The research work of this dissertation mainly consists of the following parts:1.This dissertation proposes a keywords extraction algorithm(WS-Rank*)which consider the importance of sentences.This algorithm is unsupervised graph-based method.Since the importance of sentences will directly affect the importance of words and thus affect the accuracy of keyword extraction,so in the iteration process of the algorithm,WS-Rank* algorithm takes words and sentences as nodes in the graph at the same time and updates the importance of words and sentences at the same time.Compared with the common graph-based algorithms(Page Rank,Text Rank,LTWPR,and WS-Rank)on real news datasets,F1 values improved by up to 3%.The key sentences extracted by the algorithm can be used as the input of the abstract generation model,and the key words can be used to limit the extraction process of the topic information.2.This dissertation proposes two kinds of sequence to sequence model which are used to generate the abstract.The first model is an abstract generation model based on CNN with pointer mechanism,which solves OOV(Out Of Vocabulary)problem through pointer network.Finally,experimenting on the Chinese abstract dataset(NLPCC 2017),it is shown that the pointer mechanism improve the accuracy of the model by 5%(Rouge?1),6%(Rouge?2)and 4%(Rouge?L).The second model is a pre-trained model based on MASS,which uses Transformer as based unit and is pre-trained on 200,000 corpora and fine-tuned on the NLPCC 2017 dataset.The accuracy is improved by 6%,7%,and 5%,and the resulting statements are much smoother.The output of the abstract generation model can be used as the input of the short text similarity calculation model.3.This dissertation proposes Si-t Bert model to calculate the similarity between short texts,which uses the interaction layer to grab the interaction information between sentences and add topic information of sentences after getting the semantic information of single sentence by Bert.Then,the vectors are mapped into several subspaces to calculate similarity.Finally,experiments are carried out on data sets of different sizes and tasks.It is found that compared with the original siamese network,the performance of the model can be improved by using the interaction layer and similarity layer(up to 9%).Incorporating topic information can influence the performance of the model,but this depends on the performance of the topic model itself.
Keywords/Search Tags:Similarity Calculation, Keywords Extraction, Automatic Abstracting, Deep Learning
PDF Full Text Request
Related items