Document-oriented Massive Data Mining Under Distributed Environment

Posted on:2014-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:H L Chai

Full Text:PDF

GTID:2248330392961096

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Data mining has always been a hot spot issue in Computer Science. With rapiddevelopments in Web2.0service and cloud computing in recent years, the internethas entered the big data era. Evident changes have taken place in ways of generating,transformation, storing, accessing and processing data. Traditional data mining meth-ods face tough challenges from big data, which features heterogeneous and explosivegrowth of data. This paper presents a novel approach for large scale data mining underdistributed environment, including data extraction, preprocessing, data warehousingand data mining.Generallyspeaking,acompletedataminingprocessconsistsoftwophases,name-ly data warehousing and data mining, and deals with large scale of data from multipleheterogenous sources. Data warehouse is responsible for integrating and maintainingdata, in order to guarantee the consistency and efciency of the system. The con-struction process of a data warehouse is usually called ETL process, which refers toExtracting, Transforming and Loading of data. Traditional data warehouse design isbased on RDBMS, which calls for a unifed Schema, including structure of tables andforeign keys. A well-designed schema guarantees the ACID property of the RDBMS.However, in big data era, the complexity and heterogenous and explosive growth ofdata don’t work well with schema, but require scalability, fexibility and efciency.These are bottlenecks of RDBMS.Data mining is carried out on the basis of a data warehouse. There are many ma-turedataminingalgorithms, suchasClassifcation, Clustering, Association, Predictionand so on. There are some other famous techniques applied to solve data mining prob-lems, for example, Machine Learning, Neuron Network. All these methods share those features in common, rare write and update operations, frequent read and intensive cal-culation. The mechanism in RDBMS which guarantees ACID properties has becomea constraint in this circumstance.This paper proposes a document-oriented data mining approach under distributedenvironment. The ETL process is carried out through MapReduce in the constructionof a document based data warehouse. Afterwards, a MongoDB+Lucene+MapReducesolution other than grammatical analysis, is introduced to accomplish the data miningprocess. This idea is inspired by Web Search Engine. In the end, the whole approachis validated through solving a Followee Recommendation problem in Microblog as areal case study.

Keywords/Search Tags:

Big Data, Data Mining, Data Warehouse NoSQL

PDF Full Text Request

Related items

1	Document-oriented Massive Data Mining Under Distributed Environment
2	Application Research Of Data Analysis Technology Based On NoSQL
3	Application Of Data Warehouse And Data Mining Technology In Tax Administration System
4	The Design And Implemetation For Customer Data Analyzing System Based On Data Warehouse And Data Mining Technology
5	Data Warehouse And Data Mining In The Securities Brokerage Business Crm Applications
6	Data Warehouse And Data Mining Technology Theory And Applications
7	The Analysis In Coal Mine Historical Data Based On Data Warehouse
8	Research Of SQL Server 2000 Data Warehouse And Data Mining Technology In The Military Informational Management Application
9	Research On The Application Of Data Warehouse And Data Mining In The Analysis Of Students' Academic Achievements
10	Based On The Data Warehouse, Data Mining Technology Research And Application In The Real Estate Information Analysis System