Font Size: a A A

Design And Implementation Of The Vertical Search Engines With User Interest Model

Posted on:2018-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:M X YangFull Text:PDF
GTID:2348330518494398Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In recent years, the influence of the 'Internet Age' goes deeper, and variety of information is flooded in the network. At the same time, it also brings the trouble of information overloaded. Users can not acquire available information quickly. So that, the availability of information has been reduced, lots of useful information can not be found in time.resulting in a "waste of resources." This paper introduces the design and implementation of a vertical search engine which combines user interest model. The concrete work is as follows:First of all, this essay clarifies the key problems that the system is expected to solve. It gives a brief workflow of the search engine and introduces some key technologies involved in the development process.Above all, focusing on the solution to the problem of the URL deduplication.Secondly, this essay introduces the analysis and modeling process of the user interest model in detail, then describes the way of collecting data from user and the classification on user behavior in Python environment.And on this basis the author brought out a quantification method of interest model based on hybrid behaviors, which highlighted the specificity of page browsing time and evaluated interest model based on other behaviors in the case of abnormal page browsing time.Thirdly, this essay introduces the architecture design of the system,consisting of the web crawling module, indexing and retrieval module,page display module. The vertical search engine system is developed by using Scrapy, BeautifulSoup, Whoosh and Flask based on Python. In the process of development, the author points out the problem that the original URL deduplication method of Scrapy framework can lead to serious memory consumption, and then propose a method of using a Bloom filter as a improvement method. According to practical experience, the author developed two strategies to prevent the situation that the URL we are requesting is prohibited. In order to improve Chinese word segmentation ability for Whoosh, the essay proposed the use of open source jieba word segmentation components.Finally, the essay applied the test on the system, which was tested for 32 days. The system was evaluated from four aspects: recall rate, precision,response time and dead-link ratio. By collecting the user evaluation and feedback, the conclusion was drawn.
Keywords/Search Tags:User Interest Model, Vertical Search Engine, User's Behavior, URL deduplication
PDF Full Text Request
Related items