Design And Implementation Of Focused Search Engine In Hadoop Platform

Posted on:2019-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:M F Wei

Full Text:PDF

GTID:2428330602452253

Subject:Information Science

Abstract/Summary:

With the rapid increase in the number of Internet users,the amount of data in the network has rapidly increased and the data format has increased too.Search engine has become one of the main means for network users to obtain the required information in the context of big data.But for users with different professional backgrounds,The diversification of network information makes general search engines unable to meet their information needs,it makes the current search engine development needs to be user-centered,then the topic search engine appears in this context.The increase in data volume simultaneously increases the number of data streams that search engines need to process.Distributed computing technology can better deal with massive data storage and high concurrency calculations.In this paper,the author studied the topic search engine related technologies under Hadoop platform.Precision is a key indicator for evaluating the performance of a search engine,how to increase the precision rate is the focus of this study.System retrieval response time and human-computer interaction are two indicators that search engines affect user experience.Crawler crawling speed is a direct reflection of search engine background performance.In order to improve the above indicators,the main research work of this paper is as follows.First of all,analysis of related technologies and theories of distributed computing and topic search engines,it includes distributed programming ideas,subject filtering methods of web crawler,analysis of various Chinese word segmentation algorithms,classical sorting algorithms,and commonly used clustering algorithms.Secondly,solve the problems existing in Page Rank algorithm at the current stage,and optimize it from the aspects of topic links,number of internal and external links,etc.Make it more thematic and better able to express topic relevance of a web page.Map Reduce transformation of the improved Page Rank algorithm to meet the requirements of distributed computing.Based on the original search result sorting algorithm,the TF-IDF algorithm,the OPIC algorithm and the optimized Page Rank algorithm are used to optimize the ranking results.Thirdly,using suffix tree clustering algorithm to realize real-time clustering of user's search results.The author combines real-time clustering and topic search engine for the first time,improving the interface of human-computer interaction.Through the visual clusters,users can have a more intuitive overall understanding of the search results,but also facilitate the user's browsing of specified information,improve user experience.Finally,Build a complete topic search engine,including theme reptiles,Chinese word segmentation module,index module and retrieval module.Based on this,the flexible scalability of distributed system performance is verified,improved topic crawler crawl speed and reduced system retrieval response time,at the same time,the precision of the system is improved by the optimized sorting algorithm,and lastly,real-time clustering and clustering visualization are realized by the suffix tree algorithm.At the end of the paper,the author summarized the results and deficiencies in this study,and explained the direction of the next step.

Keywords/Search Tags:

PageRank, Topic Distillation, Suffix Tree Clustering, Distributed Computing, Search Engine

Related items

1	Design And Implementation Of Meta Search Engine Based On Suffix Tree Clustering
2	Topic Search Engine Key Technology Research
3	Research On Vietnamese News Topic Recognition Method Based On Suffix Tree Clustering Algorithm
4	The Application Of Suffix Tree Clustering Algorithm In Meta Search Engine
5	Search Engine Optimization Method Based On Pagerank
6	Research Of Chinese Meta Search Engine Based On Clustering
7	The Meta Search Engine Research On Topic Distillation Algorithms
8	The Research And Design Of Search Engine Based On Distribution
9	Text Clustering And Its Application In Web Community Search Engine
10	Research And Implementation Of Meta-search Engine Based On Specialized Search Engine