Design And Implementation Of Large-scale Chinese Website Clustering Based On Hadoop

Posted on:2017-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Gan

Full Text:PDF

GTID:2308330488982874

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The text clustering analysis is an important research field in data mining. In statistics, financial analysis, biological medicine, information retrieval and document classification has been widely used. At the same time, more popular applications include web site navigation bar, paper similarity detection and user recommendation and so on.With the rapid popularization of the Internet, the number of Chinese websites is showing tremendous growth, the number of data that people obtain from web is becoming larger. Different people have different needs and standards, which led to the diversity of the data and quality requirements. How to excavate the information quickly and efficiently what we needed from pages has become a big challenge at this stage. The study of text clustering application provides a good solution. Because of the data is massive and diversity of features, so the traditional clustering analysis in the text clustering process often could not reach the ideal effect both in time and space. With the rising of cloud computing, distributed parallel framework has been adopted in clustering process, which has been researched and applied to somewhere and by more and more scholars.Hadoop is a distributed system that developed by Apache foundation infrastructure, it has two core framework design:HDFS and MapReduce. HDFS provides where the mass data been stored, and MapReduce purpose for parallel computing for vast it. This paper is designed a system that clustering analysis for Chinese web base on the Hadoop platform, the following is the main research work of this paper.1. Introduce some common ideas of classical clustering algorithm and the related theoretical knowledge. Describe the whole process of text clustering and some common methods of similarity measure in detail and so on.2. Understand the two core framework and the key technology of Hadoop platform In-depth, describes their mutual relation and running mechanism, explain the advantages of clustering experiments compare to on traditional stand-alone environments.3. Build Hadoop distributed environment, configure using the eclipse development tools, using k-means clustering algorithm writing program to test the data of web pages then get the clustering results, divided all the pages successfully. Reorganize the experiment results and analysis, the result showed the powerful computing capability that deal with large-scale data on Hadoop. And within a certain range, with the increasing of cluster nodes, the computing power is enhancing.

Keywords/Search Tags:

Text clustering, Chinese segmentation, Distributed platform, Hadoop

PDF Full Text Request

Related items

1	Design And Implemention Of High Performance Text Clustering Algorithm Basic On Hadoop
2	Research On News Recommendation Algorithm Based On LDA In Hadoop Platform
3	Research On Parallelization Of Text Clustering Based On Hadoop
4	Research And Application Of Text Mining Based On Hadoop
5	Optimization Of Som Algorithm And Application In Chinese Text Clustering
6	Research And Implementation Of Distributed Clustering Algorithm Based On Hadoop Platform
7	Distributed EM Clustering Algorithm Based On Hadoop Platform
8	Research On Clustering Algorithm Based On Distributed Platform
9	Research On Chinese Text Feature Classification Based On Distributed Framework
10	Design And Implementation Of Bank Customer Marketing Service Management Platform Based On Hadoop Distributed Architecture