Font Size: a A A

Research On Concept And Short Text Semantic Relatedness Calculation Method

Posted on:2021-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:Q S GuoFull Text:PDF
GTID:2518306272481284Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Now that the Internet era has arrived,information technology has begun to be integrated into all walks of life,and a large amount of data has been generated,among which text data is the most common.At present,researchers mainly use natural language processing technology to analyze text data.Natural language processing is an important direction in the field of artificial intelligence.It is designed to study how computers use natural language to effectively communicate with people.As an important basic research topic in natural language processing,conceptual and textual semantic relatedness measurement are widely used in many fields,such as word sense disambiguation,information extraction,automatic summary,question answering system and text classification,etc.Although there have been many studies on the conceptual and textual semantic relatedness measurement,with the continuous change of people's needs and the continuous development of natural language processing technology,the accuracy and breadth of semantic relatedness calculation can be continuously improved.Therefore,this paper will study the semantic relatedness measurement from the two dimensions of concept and text.The main work of this paper are as follows:(1)At present,the conceptual semantic relatedness method based on Wikipedia has the problems of tedious preprocessing,complex calculation,and low accuracy.In view of this phenomenon,the Wikipedia Two-way Link-vector Model is proposed.The links in Wikipedia are manually defined by volunteers and formed by review,which is closer to human semantics,In order to better simulate human semantics,this paper explains the concept by combining the external and internal links of the concept in Wikipedia into a two-way link vector.Aiming at the unique structure of Wikipedia,this paper proposes a disambiguation strategy based on Wikipedia disambiguation page,which uses people's public cognition to better select accurate sense terms for calculation,and eliminates negative interference on the accuracy of conceptual semantic relatedness.In the experiment,it is found that the degree of link overlap between Wikipedia concepts is low,which will cause some calculation results to have non-linear deviations from the artificial values.Therefore,through the nonlinear processing of logarithmic function and exponential function,the cosine similarity and Jaccard similarity coefficient calculation formulas are improved,and two vector similarity formulas suitable for Wikipedia bidirectional linking are proposed.The improved vector similarity formula is used to calculate the distance between concept interpretation vectors to quantify the semantic relatedness.In this paper,the model is used to test multiple data sets such as MC30,RG65,WS353 and men3000,etc.Compared with other algorithms,the experimental results performed well.In the interpretation task,the F1 value of this method reached 0.81.(2)In order to improve the accuracy of the semantic calculation of words in different language environments,this paper uses the word semantic item sets and synonyms sets that have been arranged in Word Net as prior knowledge,and proposes a sense item vector based on Wikipedia word statistics.The generative method aims to improve the problem that the word vector cannot resolve the problem of polysemy,and a method of corresponding disambiguation is proposed.On the SCWS-2003 data set,compared with the original word statistical vector,the Spearman coefficient of the sense term vector in this paper is increased by 15%,indicating that the method has certain effectiveness.(3)In terms of short textual relatedness,this paper mainly focuses on the semantic relatedness of Chinese texts,and proposes a short textual semantic relatedness calculation method based on the convolutional neural network(CNN)and bidirectional long short-term memory networks(Bi LSTM).The CNN is used to process text data which can extract local features,but global information will be lost after the pooling layer.The Bi LSTM can remember long-term information,which can solve the problem of long-term dependence in text.In order to extract the features of different granularity of sentences,this paper combines the CNN without the pooling layer and the Bi LSTM to build a Siamese network framework,and uses this algorithm to test Chinese STS,Chinese LCQMC and CCKS2018 respectively,on the CCKS2018 data set,the accuracy and F1 value of the method in this paper are both 0.9.
Keywords/Search Tags:Wikipedia, Semantic Relatedness, Link Vector, Neural Networks, Concept Disambiguation
PDF Full Text Request
Related items