Font Size: a A A

Research On The Full Text Retrieval In Scientific Literature Sharing Platform

Posted on:2008-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:L Y TanFull Text:PDF
GTID:2178360272468720Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the explosion of the scale of the web and the enrichment of the resource we can access, it also bring us the problem that we have to spend a lot of time and energy to find the information we indeed need. Traditional literature retrieval system considered the documents as BOW(bag of words), and calculated the cosine distance between the document vector and query vector as criteria to rank the retrieval list. However, this method did not harness the context information of the article which is helpful for similarity evaluation. In SemreX, we adopt the new ranking algorithm in which the context information of paper is considered it brings the context information such as classification of article and reference validation into the process of similarity calculation. we evaluate the outputs of ours and traditional method with TREC_EVAL program against the traditional method. The experiment results obviously indicate that new method can obviously enhance the retrieval precision relative to the traditional way.Another import function of SemreX is to find similar literature of customer's favorite one. This function is quiet common in other literature retrieval system, such as Citeseer, CNKI and so on. To find similar literature also reveal the relationship of semantic. Because finding similar literature is very time consuming, so we use the term compress, and candidate literature set to enhance retrieval effectiveness. At the mean time, we use the IT theory to evaluate the similarity of the literature. Because we use the candidate literature set, this make our system's computing time will nonlinear increase with the aggregate of literature repository. Currently, SemreX can find twenty thousand literature's similar document one day.Generally speaking, traditional retrieval system will return very large mount of result, and the result relationship is not visible to the user. This make the user takes a lot of time to browse the retrieval list. SemreX use the online classification algorithm to analysis the result list for the user at first, it classify the result and label each cluster with a eligible string. And then represent this result with GUI interface. So user can browser the result conveniently, and this will enhance the retrieval effectiveness...
Keywords/Search Tags:information retrieval, reference validation, similar literature measure, cluster online algorithm
PDF Full Text Request
Related items