Font Size: a A A

Design And Implementation Of Large-scale Chinese Website Clustering Based On Hadoop

Posted on:2017-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z GanFull Text:PDF
GTID:2308330488982874Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The text clustering analysis is an important research field in data mining. In statistics, financial analysis, biological medicine, information retrieval and document classification has been widely used. At the same time, more popular applications include web site navigation bar, paper similarity detection and user recommendation and so on.With the rapid popularization of the Internet, the number of Chinese websites is showing tremendous growth, the number of data that people obtain from web is becoming larger. Different people have different needs and standards, which led to the diversity of the data and quality requirements. How to excavate the information quickly and efficiently what we needed from pages has become a big challenge at this stage. The study of text clustering application provides a good solution. Because of the data is massive and diversity of features, so the traditional clustering analysis in the text clustering process often could not reach the ideal effect both in time and space. With the rising of cloud computing, distributed parallel framework has been adopted in clustering process, which has been researched and applied to somewhere and by more and more scholars.Hadoop is a distributed system that developed by Apache foundation infrastructure, it has two core framework design:HDFS and MapReduce. HDFS provides where the mass data been stored, and MapReduce purpose for parallel computing for vast it. This paper is designed a system that clustering analysis for Chinese web base on the Hadoop platform, the following is the main research work of this paper.1. Introduce some common ideas of classical clustering algorithm and the related theoretical knowledge. Describe the whole process of text clustering and some common methods of similarity measure in detail and so on.2. Understand the two core framework and the key technology of Hadoop platform In-depth, describes their mutual relation and running mechanism, explain the advantages of clustering experiments compare to on traditional stand-alone environments.3. Build Hadoop distributed environment, configure using the eclipse development tools, using k-means clustering algorithm writing program to test the data of web pages then get the clustering results, divided all the pages successfully. Reorganize the experiment results and analysis, the result showed the powerful computing capability that deal with large-scale data on Hadoop. And within a certain range, with the increasing of cluster nodes, the computing power is enhancing.
Keywords/Search Tags:Text clustering, Chinese segmentation, Distributed platform, Hadoop
PDF Full Text Request
Related items