Font Size: a A A

The Improvement Of Academic Retrieval System Based On Citation Analysis

Posted on:2013-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:S H WuFull Text:PDF
GTID:2248330371488142Subject:Information Science
Abstract/Summary:PDF Full Text Request
Currently, academic information retrieval system became the essential research tool for researchers. But now in many well-known academic databases like Elsevier、 Web of Science、CNKI, their retrieval module use a solution mainly based on literature content. And because of the literature’s content is similar between each other, they are difficult to identify, so the retrieval solution based on content makes users difficult to find the information they need, hurts the user experience.And unlike the other texts, in addition to its own content, academic literature has many other information, including references、cites、author institution、source journal、funds. This external information largely reflects the quality and content of documents, in the search process, users often use this information to determine the relevance. This study will use the reference relationship in search results clustering and relevance feedback and design an academic information retrieval architecture in utilization of content、references、cites、author、institution、source journal、funds etc.The conclusions of this research are:(1) I investigated the correlation of co-citation bibliographic coupling and literature content by correlation analysis in statistics. I downloaded the papers in BioMed database, obtained the text similarity (title-abstracts similarity and full-text similarity) and the matrix of co-citation and coupling. The correlation analysis result showed the number of co-citation and coupling had significant correlation relationship with text similarity.(2) Use the citation context to expand the "bags of words" model, a citation context is the text surrounding the reference markers used to refer to other scientific works; the citation context can provide additive terms to represent the academic literature. The experimental results show that this text represent method effectively enhance the effect.(3) A novel algorithm based on co-citation analysis is proposed, this algorithm is divided into two steps. The first step is to do co-citation analysis in the academic literature set, and get the matrix of co-citation, and run hierarchical clustering algorithm based on the matrix. In each iteration, distance of academic literature in a cluster and the difference of the distance between two iterations are recorded. In the end of first step, the value of K and the centers of every cluster are selected for the second step when the maximum of the difference is achieved. The second part of the research is to execute the K-means algorithm based on the content of academic literature. Experimental results show that the clustering quality is improved.(4) A novel cluster label extracting algorithm for English paper based on n-gram is proposed. Before the clustering, this algorithm first uses n-gram to generate the field phrases list by prior learning in the large-scale corpus. Then cluster the English paper using K-means algorithm. Finally, we extract the highest score n-gram terms from the cluster as the label. In the score calculation, if the term exists in the field phrases list, we set it double weight. Experimental results show that the quality of cluster label is improved. Furthermore, an improved TFIDF calculation method is developed, a new R@N method to evaluate the cluster label is proposed.(5) A novel relevance feedback algorithm based on co-citation and bibliographic coupling is proposed. In the stage of relevance judgment, we use the relation of co-citation and bibliographic coupling in citation network to expand the set of relevance document. Finally, the algorithm uses the clustering method to extract terms to expand query in relevance document.(6) Design a new academic information retrieval architecture in utilization of algorithm proposed above and the comprehensive information including content、references、cites、author、institution、source journal and funds etc.This study use experiments to compare the performance of new algorithm and the existing algorithm, and new algorithm achieved good results. I believed this research may have a value to improve the academic information retrieval system.
Keywords/Search Tags:Co-Citation, bibliographic coupling, citation context, n-gram, searchresult clustering, relevance feedback, cluster label, K-means algorithm
PDF Full Text Request
Related items