Font Size: a A A

Text Similarity Analysis Technology Based On Deep Learning And Its Application In Auxiliary Decision-making Of HIA

Posted on:2022-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:A WenFull Text:PDF
GTID:2518306743986999Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text similarity analysis technology is one of the most widely used key tasks in the field of natural language processing.It is widely used in the fields of medical question answering system,text information retrieval and so on.This thesis studies the text similarity analysis technique based on deep learning,and applies it in keywords recognition and literature retrieval tasks in the health impact assessment.Besides,to solve the problem of the unsatisfactory performance of traditional Chinese text similarity analysis models on text matching between texts that have a large difference in length,this thesis proposes a similarity model for matching between short and long form texts based on the “Health Impact Assessment” data set.Finally,based on the text similarity analysis model designed in this thesis,a “Health Impact Assessment Assistance System” with a set of functions including entity keywords recognition and literature retrieval is realized to assist specialists to complete the task efficiently and effectively.This thesis is divided into following three parts:(1)After constructing “Health Impact Assessment” data set extracted by web crawler Scrapy,there are a large number of similar literatures in the literature abstract corpus.In this thesis,a hybrid model Jaro-LSA is built by combining the Jaro-Winkler similarity model based on strings with the Latent Semantic Analysis(LSA)similarity model based on the corpus,which implements the semantic similarities elimination of the literature abstract corpus.Then,texts to be evaluated in policy document are matched with the abstract texts in literature abstract corpus one by one through manual labelling to construct a relative high-quality “Health Impact Assessment” data set.(2)To complete the task of text similarity analysis efficiently and effectively,the following two aspects are taken into consideration.Firstly,in regards to the matching problem of short and long form texts in “Health Impact Assessment” data set,this thesis takes the context information of the two paragraphs with different lengths as the key information,and merges the original text with their key information by word embedding respectively to form the semantically expanded textual representation as the input of the model.Secondly,considering the relatively small scale of the data set,this thesis adopts the high-performance Gated Recurrent Unit(GRU)as the fundamental unit and integrates the attention mechanism to enhance the performance of the final model.In addition,to study the representation information of texts,two embedded vector spaces are processed through two Bi GRU-Attention models with the same parameters.The experiments showed the proposed model has a better performance compared to other models on text similarity matching.(3)On the basis of KBS-GRU model with attention mechanism,this thesis designs and builds the "Health Impact Assessment" auxiliary system with the functions of health impact literature retrieval and automatic keyword recognition.The system has launched online for Hangzhou Municipal Health Commission from July 2021.
Keywords/Search Tags:Natural language processing, Crawler technology, Health impact assessment, Text similarity analysis, Literature retrieval
PDF Full Text Request
Related items