Font Size: a A A

Research And Application Of Vertical Search Engine In The Tobacco Industry

Posted on:2017-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:L F ChenFull Text:PDF
GTID:2308330482980615Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet provides us with the extremely abundant information on one hand, on the other hand it also increase the difficulty of information filtering. Internet users increasingly rely on search engines to narrow the scope of the information retrieval, then get the interesting content in more efficiently. The user’s expected result is small and exquisite, in contrast to the specific needs of users, the pursuit of general search engine is a large and various set. In this context, the vertical search engine emerges as the time require. If we regards the Internet as a service provider, its an important sign of maturity is development that from generalization to customizing, the emergence of the vertical search engine is the manifestation of this transformation which represents the direction for the future of search engine.Firstly, combined with the development of the Internet, this paper summarized the background and significance of vertical search engine. In the case of the tobacco industry, the contradiction between general search engine’s inherent limitations and the enterprise’s growing demand in information retrieval was analyzed, meanwhile, the necessity and feasibility of vertical search engine could be applied to the enterprise was demonstrated.Combined with my internship and project experience in the tobacco enterprise, the purpose of this paper is to design and implement a vertical search engine for the tobacco industry. On the basis of fully research in the overall architecture and key technology of vertical search engine, we put forward a effective method named “three-times-filter” for subject discrimination and improved the adaptability of the Page Rank algorithm in practical application. Then we realized the localization of Lucene’s source code, eventually developed a vertical search engine suitable for the tobacco industry.Main research content of this article is as follows:(1) On the basis of studying the typical architecture of search engine, three kinds of key technologies includes Chinese word segmentation, inverted index and hyperlink analysis were introduced in detail. Through the lateral comparison, the boolean model was used to implement the basic text filtering and the space vector model was used to implement advanced matching operation, With this we established the search model accord with the characteristics of the tobacco industry, which both hold the two advantages of simpleness and support relevance scoring.(2) The iterative process of Page Rank algorithm was simulated by Java programming, as the same time, the black hole problem in hyperlinks matrix and data imbalance in Page Rank vector was analyzed in deep. The black hole problem will lead to a monopoly contrary to objective and scientific, and the data imbalance will cause the convergence rate of the iterative process too slow, as a result, it will not be suitable to applied to large-scale computing. By introducing the theory of Markov chain, random adjustments were added into original model in two ways, as a result, it made the model more in line with the process of web surfers’ web surfing, fundamentally eliminated the black hole problem and accelerated the convergence speed of the Page Rank vector.(3) This paper proposed a method for subject discrimination named “three-times-filter”. By using the tobacco industry professional dictionary, anticipation factor, meta-information factor and thesaurus factor were taken into consideration when topic relevancy was being computed. These web pages that has nothing to do with provided topic will be filtered effectively, greatly improves the precision of search engine. On the other hand, the anticipation factor will also be used to adjust the priority queue of URL and help topic crawler to download the pages that have higher topic relevancy preferentially.(4) The Lucene’s source code was integrated into this application system through Lucene’s localization. Combined with the original space vector model, various factors related to web pages’ popularity and topic relevancy were adjusted,and the frequency, category, universality of the key words as well as the length of the document were taken into consideration, so that a score formula accord with the characteristics of the tobacco industry was obtained.
Keywords/Search Tags:Vartical Search, Pagerank Algorithm, Subject Discrimination, Document Arrangement
PDF Full Text Request
Related items