Font Size: a A A

Research And Implementation Of Tax Vertical Search Engine And Improved PageRank Algorithm

Posted on:2020-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:2428330626950000Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The business background in the tax field is complex.The deeper level of tax system is inconvenient to remember and search,and the diversity of information policies leads to the inconvenience of information retrieval.The existing general search engine has many search results,low accuracy,lack of authority,and there are some problems such as competitive ranking and web page cheating.PageRank,a web page ranking algorithm,has the problems of topic drift,partial emphasis on old web pages and lack of authority,and can not evaluate the content of web pages because of only scoring links.To solve the above problems,a tax vertical search engine based on improved PageRank algorithm is proposed.The goal is to provide a thematic and authoritative tax search,to achieve accurate queries on the basis of accurately understanding users' search intentions,and to provide convenient and fast access to tax module and information retrieval functions.The main work of this paper includes:(1)Research and improve the sorting algorithm of web pages.After studying the principle of PageRank and HITS algorithm,comparing their advantages and disadvantages,PageRank algorithm is chosen as the basis for improvement in the following three aspects: introducing authoritativeness factor for the lack of authoritativeness of websites;integrating time evaluation factor for the features of new webpages that can not be scored;and integrating content relevance degree of webpages based on spatial vector model for the topic drift problem.Calculating method.Experiments show that the improved algorithm can effectively improve the topic drift problem,enhance the freshness and authority of web pages.(2)Research and implementation of Chinese word segmentation.Considering the programming language,integration difficulty,function richness and lexicon extensibility,Ictclas Chinese word segmenter is selected to assist web page content and search text analysis.The tax question and answer corpus is segmented with a word segmenter to construct a tax subject thesaurus.(3)Information capture module.Preset the initial crawl database,calculate the similarity between the web pages to be crawled and the thesaurus with the space vector model,and improve the algorithm to score the crawled web pages.From hyperlinks and thesaurus,the subject constraints are applied to crawl web pages,and the quality of search content is controlled by information sources.(4)Implementation of vertical search engine based on Nutch and Solr.The boost scoring mechanism is affected by introducing the improved algorithm into the indexing process of grabbed content.The experimental results show that the improved algorithm has a good effect in the application of tax vertical search engine.With the increase of the number of web pages,the proportion of topic correlation of the improved algorithm decreases more slowly than that of the original algorithm and is always better than that of the original algorithm.The accuracy of the improved algorithm is about 15% higher than that of the original algorithm,and the accuracy of the first 15 results reaches 72%.The authority and freshness of the improved search results are also significantly improved.
Keywords/Search Tags:Vertical Search Engine, Subject Crawler, Page Rank Algorithm, Tax Field
PDF Full Text Request
Related items