Font Size: a A A

Design And Implementation Of The Distributed Clustering Search Engine Based On Mapreduce

Posted on:2015-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z H YuFull Text:PDF
GTID:2308330473453364Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of Internet technology, social structure changed, the network is more depended on to express their views and ideas. The network information increased massively. Meantime, the network is more depended on to obtain information. However, the current set of centralized search engines is inefficient when dealing with huge amounts of data. What’s worse, due to the network data is increasing, large number of search results are list and linearly, with vast amount of irrelevant information to the user in which flooding, moreover the existing search engines have less access to network information. In a very short period of time, the user can not navigate to the information you want to query.Thus,enabling users to quickly,accurately and comprehensively access to the information to be retrieved for users will be very urgent needs.In order to solve the user cannot quickly locate the desired information on the general search engines, as well as the inefficient centralized search engines when dealing with huge amounts of data. In this thesis, a comprehensive study of the relevant content search engine, data mining and distributed Hadoop clusters, mainly to complete the following tasks:1. A distributed cluster-based clustering based on search engine designed and implemented, including information gathering, information preprocessing, and the demand for information retrieval clustering process and display the user’s search results.2. For comprehensive information can not be crawling problem, this meta-search engine based on Nutch and script interpreter engine Rhino, proposed and designed a combination of static and dynamic web crawling strategy, a comprehensive network of dynamic and static pages of information can be obtained in the information gathering stage.3. In this thesis, clustering module for Canopy-Kmeans clustering algorithm and Canopy-Kmeans algorithm based on improved maximum and minimum principles achieved by the serial run MapReduce parallel operation mode, turn on the search results clustering, while the LDA、the Dirichlet and other classic clustering algorithm also realized, for different types of repositories can choose different clustering algorithm to achieve relatively good results.4. In the cluster label generation process, what this thesis designs and implements is automatically generated and custom combination, making the cluster label with good readable and rationality.5. The user to retrieve the display module, this thesis hierarchical directory structure to show the relationship between the search results clustering, enabling users to more efficient and more accurate view of the search results.
Keywords/Search Tags:Search engine, Clustering, Distributed, Hadoop, Clusters
PDF Full Text Request
Related items