Font Size: a A A

Design And Implementation Of Clustering Algorithm For Large Scale Chinese Short Text Based On Mapreduce

Posted on:2015-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y F YangFull Text:PDF
GTID:2268330428982557Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text clustering is a important research direction in the field of data mining and information retrieval. With the popularity of the Internet, the accumulation of data growing rapidly on the network, most of the data is stored in the form of text.. How to mine massive text on a web page has become a huge challenge faced by the field of computer science. Text clustering techniques provides an effective way for the category management of massive textual information. As an unsupervised machine learning method text clustering technology can automated by the computer, it can found the intrinsic characteristics and distribution of text by comparing the similarity of text, It can not only effectively organize the text of the web, can also format classify templates be used to guide the formation of web text classification in order to retrieve and read.In recent years, text clustering in information retrieval, automatic text summarization and more internet areas access to a wide range of application. The rise of cloud computing, distributed parallel computing offers more frames, more and more researchers are also concerned the distribute of text mining technology.The core issues of the era of big data are that data mining technology filters to smelting ofprecious metals. Processing and analysis of these data and extracting useful information by theform of cloud computing, has become an important research direction in the field of data mining.And Hadoop is the Apache’s open source software, which provides a distributed file system and the computing framework of MapReduce. It includes the infrastructure of the cloud computing software platform and integrated a set of components, such as databases, data warehouses. Hadoop has become an academia and industry standard platform for cloud computing research and applications. This paper focuses on Hadoop software framework, such as the core architectures and operating mechanisms of HDFS, MapReduce, HBase and other components.and then analyzes the shortcomings of the framework, such as the single point failure of HDFS and MapReduce. Then give the corresponding solutions, and based on this to build a highly reliable and secure Hadoop environment. Combined with the characteristics of the traditional classification and clustering algorithms, give a cloud-based data mining system design. Describe the level functions of the system in detail, especially the classification and the clustering modules, which mainly include the following few points:(1) to build a suitable for text clustering using Hadoop distributed platform, and through the Hadoop and Linux for tuning system.(2) according to the characteristics of short text, using the vector space model, TF-IDF formula and the cosine formula to design a suitable clustering method to analyse the Chinese short text.(3) with Eclipse development tools, Java programming and Shell startup scripts and system integration and testing the system with1700000experimental data, and the experimental results are analyzed, put forward the improvement scheme.
Keywords/Search Tags:Text clustering, Hadoop, MapReduce, Parallel Algorithm, Data Mining
PDF Full Text Request
Related items