Font Size: a A A

Design And Implementation Of Focused Search Engine In Hadoop Platform

Posted on:2019-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:M F WeiFull Text:PDF
GTID:2428330602452253Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the rapid increase in the number of Internet users,the amount of data in the network has rapidly increased and the data format has increased too.Search engine has become one of the main means for network users to obtain the required information in the context of big data.But for users with different professional backgrounds,The diversification of network information makes general search engines unable to meet their information needs,it makes the current search engine development needs to be user-centered,then the topic search engine appears in this context.The increase in data volume simultaneously increases the number of data streams that search engines need to process.Distributed computing technology can better deal with massive data storage and high concurrency calculations.In this paper,the author studied the topic search engine related technologies under Hadoop platform.Precision is a key indicator for evaluating the performance of a search engine,how to increase the precision rate is the focus of this study.System retrieval response time and human-computer interaction are two indicators that search engines affect user experience.Crawler crawling speed is a direct reflection of search engine background performance.In order to improve the above indicators,the main research work of this paper is as follows.First of all,analysis of related technologies and theories of distributed computing and topic search engines,it includes distributed programming ideas,subject filtering methods of web crawler,analysis of various Chinese word segmentation algorithms,classical sorting algorithms,and commonly used clustering algorithms.Secondly,solve the problems existing in Page Rank algorithm at the current stage,and optimize it from the aspects of topic links,number of internal and external links,etc.Make it more thematic and better able to express topic relevance of a web page.Map Reduce transformation of the improved Page Rank algorithm to meet the requirements of distributed computing.Based on the original search result sorting algorithm,the TF-IDF algorithm,the OPIC algorithm and the optimized Page Rank algorithm are used to optimize the ranking results.Thirdly,using suffix tree clustering algorithm to realize real-time clustering of user's search results.The author combines real-time clustering and topic search engine for the first time,improving the interface of human-computer interaction.Through the visual clusters,users can have a more intuitive overall understanding of the search results,but also facilitate the user's browsing of specified information,improve user experience.Finally,Build a complete topic search engine,including theme reptiles,Chinese word segmentation module,index module and retrieval module.Based on this,the flexible scalability of distributed system performance is verified,improved topic crawler crawl speed and reduced system retrieval response time,at the same time,the precision of the system is improved by the optimized sorting algorithm,and lastly,real-time clustering and clustering visualization are realized by the suffix tree algorithm.At the end of the paper,the author summarized the results and deficiencies in this study,and explained the direction of the next step.
Keywords/Search Tags:PageRank, Topic Distillation, Suffix Tree Clustering, Distributed Computing, Search Engine
PDF Full Text Request
Related items