Font Size: a A A

Document-oriented Massive Data Mining Under Distributed Environment

Posted on:2014-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:H L ChaiFull Text:PDF
GTID:2248330392961096Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data mining has always been a hot spot issue in Computer Science. With rapiddevelopments in Web2.0service and cloud computing in recent years, the internethas entered the big data era. Evident changes have taken place in ways of generating,transformation, storing, accessing and processing data. Traditional data mining meth-ods face tough challenges from big data, which features heterogeneous and explosivegrowth of data. This paper presents a novel approach for large scale data mining underdistributed environment, including data extraction, preprocessing, data warehousingand data mining.Generallyspeaking,acompletedataminingprocessconsistsoftwophases,name-ly data warehousing and data mining, and deals with large scale of data from multipleheterogenous sources. Data warehouse is responsible for integrating and maintainingdata, in order to guarantee the consistency and efciency of the system. The con-struction process of a data warehouse is usually called ETL process, which refers toExtracting, Transforming and Loading of data. Traditional data warehouse design isbased on RDBMS, which calls for a unifed Schema, including structure of tables andforeign keys. A well-designed schema guarantees the ACID property of the RDBMS.However, in big data era, the complexity and heterogenous and explosive growth ofdata don’t work well with schema, but require scalability, fexibility and efciency.These are bottlenecks of RDBMS.Data mining is carried out on the basis of a data warehouse. There are many ma-turedataminingalgorithms, suchasClassifcation, Clustering, Association, Predictionand so on. There are some other famous techniques applied to solve data mining prob-lems, for example, Machine Learning, Neuron Network. All these methods share those features in common, rare write and update operations, frequent read and intensive cal-culation. The mechanism in RDBMS which guarantees ACID properties has becomea constraint in this circumstance.This paper proposes a document-oriented data mining approach under distributedenvironment. The ETL process is carried out through MapReduce in the constructionof a document based data warehouse. Afterwards, a MongoDB+Lucene+MapReducesolution other than grammatical analysis, is introduced to accomplish the data miningprocess. This idea is inspired by Web Search Engine. In the end, the whole approachis validated through solving a Followee Recommendation problem in Microblog as areal case study.
Keywords/Search Tags:Big Data, Data Mining, Data Warehouse NoSQL
PDF Full Text Request
Related items