Font Size: a A A

Research On Text Retrieval Based On Topic Analysis

Posted on:2016-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:X L LuoFull Text:PDF
GTID:2308330464472621Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of search engine technology, we can access the information we need through the Internet anytime and anywhere. However, web information is growing exponentially and people have higher demand for search results than ever. How to make the search engine more intelligent and personalized has become a problem that needs to be resolved urgently. Besides, it is also a great challenge to identify desired information more efficiently and accurately from such a vast ocean of information.When descripting the relationship between user queries and candidate documents, traditional information retrieval systems only take the word literally matching into consideration without making the best use of the correlation information of word semantic expression, which makes a huge gap between the search results and user requirements.In this paper, we first utilize topic model to extract the main topics of the candidate documents and then calculate the relevancy between the query words and the candidate documents by using these topics. Finally, we rank the documents according to the relevancy and present the results to the end user. During the process mentioned above, the topic model demonstrates a great deal of deficiencies:firstly, the selection of the numbers topic named k is too random which may result in a high overlap and a low degree of discrimination between topics; secondly, the corpus based on topic distribution, to some extent, can not completely represent the topic distribution of a single article. It may result in high sparseness of probability distribution inside topic and even damage the precision of single document topic feature representation. Based on the shortcomings mentioned above, in Chapters III and IV, we improve the model so that it can play a more important role in the information retrieval.In Chapters III, we present a text retrieval method based on Word Representation Latent Dirichlet Allocation. This topic modeling method fully takes the initial process of topic sampling and overlapping relationships between topics into account. We descript the relationship between topics by using word representation and determine the value of k, which will keep the relative independence between topics. We then model the corpus and get two polynomial distribution matrixes, i.e. document - topic, topic - word. By calculating these two matrixes, we get the representation relationship between word and document. We call this relationship "semantic contribution" in this article. Based on the semantic contribution of the words, we describe the tightness between the query terms and the candidate documentation sets with the "value". And then we can rank these candidate documents and display the ranked results on the user query interface.In Chapters Ⅲ, we presents a text retrieval method based on Cluster Word Representation Latent Dirichlet Allocation. This method is a further improvement based on the work in chapter Ⅲ. Traditional topic model in information retrieval performs below expectations when it comes to topic model module. Through analysis, topic modeling in corpus loses the precision of topics characteristics expression in single document to some extent, and thus affects the topics distribution of single document. Based on the above analysis, we cluster the documents before modeling the documents, gather the documents of the same or similar topics as many as possible and model every cluster using topic model, which brings the modeling ability of topic model into full play. In the words semantic contribution computing, we use interactive encyclopedia knowledge to improve the calculation of semantic contribution between words, which makes semantic relationships between words become more accurate.In our experiment, we use NTCIR-5 (NACSIS Test Collections For IR) as corpus and utilize the TREC information retrieval evaluation tools to evaluate related index. The experiment results show that the document retrieval systems based on word representation topic model and cluster topic model have good performances on MAP R-precision and P@N and improve the accuracy and recall rate of retrieval system. This indirectly indicates the feasibility of the proposed method in this paper.
Keywords/Search Tags:Search Engine, Information Retrieval, Word Representation Latent Dirichlet Allocation, Cluster Word Representation Latent Dirichlet Allocation, Semantic Contribution, Topic Modeling
PDF Full Text Request
Related items