Font Size: a A A

Stem Extraction And Related Ranking Optimization For Lightweight Retrieval Services

Posted on:2023-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:2568306836464594Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rise of a new generation of information technology and the rapid development of the Internet industry,the volume of data has grown dramatically.In order to meet the needs of billions of users to quickly obtain effective information from massive data,it is of great significance to improve the retrieval quality and query efficiency of search engines,but it also faces challenges.On the one hand,the query words of users are becoming more and more complex,and the characteristics of the morphological variation of language vocabulary lead to the diversification of the search words,and the existing stem extraction algorithms generally have problems such as understemming and unsatisfactory stem extraction accuracy.On the one hand,it is a very time-consuming task to retrieve document results that meet the user’s query requirements from massive data,and the existing methods of dividing documents into multiple servers to handle query delay often have the problem of tail delay.Aiming at the above two issues,this paper conducts in-depth research on text preprocessing and related query ranking.First,in the text preprocessing stage,the word form normalization algorithm APS is designed,which effectively improves the existing algorithms such as understemming,and unsatisfactory stemming accuracy.The algorithm adjusts the definition of the rule function based on flexion-derived morphology,optimizes feature word extraction,and adds the processing of irregular verbs and several suffixes,while adding support for deactivation filtering.For the evaluation of the APS algorithm,the experiments in this paper are conducted on three real datasets to verify the effectiveness of the APS optimization algorithm for improving the problem of insufficient word stems and the authenticity of improving the accuracy of word stem extraction.Second,in the related query ranking stage,an anytime ranking algorithm SAR is designed based on the Score-At-A-Time query processing strategy.The algorithm is able to terminate the query process early after processing a specified number of inverted segments or within a given time budget,which greatly reduces the delay time of query evaluation and returns more accurate retrieval results at the expense of retrieval quality within an acceptable range,which effectively improves the tail delay problem prevalent in existing methods.Experiments are carried out on two real large-scale TREC standard datasets,Clue Web09 b and Clue Web12-B13,and the SAR algorithm is evaluated by the retrieval quality evaluation index n DCG@10.The query delay and the reduced number of inverted segment processing under a given time budget are recorded,which verifies the effectiveness of the SAR algorithm for controlling the tail delay time.Finally,the lightweight general-purpose information retrieval framework ADJASSjr is designed and implemented based on the stemming algorithm APS and the anytime ranking algorithm SAR,and evaluates against existing open source search engines on the TREC dataset WSJ.The experimental results show that ADJASSjr reduces the time overhead of query latency by 25%-35% while maintaining better retrieval quality.
Keywords/Search Tags:stemming algorithm, anytime ranking algorithm, text preprocessing, Score-At-A-Time, related ranking
PDF Full Text Request
Related items