Research On Information Retrieval Method Based On Semantic Similarity Of Central Sentences | | Posted on:2022-01-06 | Degree:Master | Type:Thesis | | Country:China | Candidate:X Huang | Full Text:PDF | | GTID:2518306350451844 | Subject:Computer Science and Technology | | Abstract/Summary: | PDF Full Text Request | | With the explosive growth of information in current era,how users can quickly find the information they need from massive amounts of data is very important.Not only did it bring about many academic issues worth studying,but it also spawned technology giants such as Baidu and Google whose main business is search engines.One of the core technologies of search engines is information retrieval.Text information retrieval studies the process of searching documents that meet user’s needs from a large-scale of document collections.The key technologies include:calculating matching between queries and documents,and then scoring matching degree and sorting documents.A good information retrieval model can rank the documents related to the query’s subject in the top position.Traditional information retrieval models can be regarded as exact matching or relevant matching of terms.These models have high retrieval efficiency but ignoring semantics of words.Therefore,traditional information retrieval models have not resolved the semantic matching problem between queries and documents.Neural ranking models use deep neural network to obtain the semantic representation of the query and document after training,and then calculate the semantic matching between the query and the text through interactive coding or cosine similarity to tackle the problem of semantic mismatch.Recently,scholars have used the advantages of traditional models and neural ranking models to propose information retrieval methods that combine correlation matching and semantic matching.However,they use all sentences of the document to calculate the semantic similarity to the query,which is time-consuming and expensive.At the same time,these existing hybrid information retrieval models need multiple hyperparameters to tune the model,which is not concise and efficient.In response to the above problems,this thesis proposes an information retrieval method and a framework combing the method:the central sentence semantic similarity model CSSS and the retrieval framework of hybrid models Xcsss.This thesis designs a central sentence semantic similarity information retrieval model CSSS.First,the model proposes to use a sliding window mechanism to extract the central sentence of the candidate documents;secondly,in order to save calculation time,only the central sentence is used instead of the entire sentence of the document to calculate the semantic similarity between the document and the query topic;finally,a large number of experiments were conducted on the four Text REtrieval Conference(TREC)data sets to verify the effectiveness of the CSSS model.Experiments show that the MAP value and P@20 value of the CSSS model are significantly improved compared to traditional information retrieval models.In the case of using the same pre-training model and parameter settings,this thesis further compares the retrieval model that calculates semantic similarity on all sentences.The experimental results show that it is more accurate to use the central sentences of the document instead of all the sentences of the document to participate in semantic matching.This thesis also proposes an improved hybrid information retrieval framework XCSSS,which uses a non-linear combination method to combine relevance matching and semantic matching.Compared to other hybrid models uses a linear combination method,the hybrid models formed by this framework have much less hyperparameters and less time of parameter tuning.Using this framework on the basis of the representative BM25 of the classic probability retrieval model and the representative DLM of the language model,two improved hybrid information retrieval models are formed:BM2SCSSS and DLMCSSS.This thesis conducts a serious of experiments on four TREC data sets to verify the effectiveness of the hybrid information retrieval models BM25CSSS and DLMCSSS.The experiments show that the MAP value and P@20 value of the improved hybrid models BM25CSSS and DLMCSSS have been significantly improved compared to their respective baseline models.The retrieval performance of these two improved models is also significantly better than some deep neural ranking models.Compared with other hybrid information retrieval models of the same type that combine the same pre-trained language model,the hybrid model proposed in this thesis has lower time complexity ensuring retrieval performance,which can effectively save the semantic calculation overhead on the pre-training language model.The experiment further compares the performance of the hybrid model and the central sentence semantic similarity model,and the results shows that the method that combines semantic matching and relevance matching is more effective.Finally,the thesis analyzes the impact of the number of central sentences on the performance of these two improved hybrid retrieval information retrieval models. | | Keywords/Search Tags: | Information Retrieval, Neural Ranking Model, Probability Model, Language Model, Semantic Matching | PDF Full Text Request | Related items |
| |
|