Font Size: a A A

Optimizing Of Hadoop Based On Attribute Reduction Under Covering Rough Set

Posted on:2016-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2428330473465659Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As one of the most important subsets of data mining,attribute reduction algorithm that based on covering rough set has been applied in different areas to analyze and extract model,like computer,biology and chemistry.However,the traditional calculation on single server cannot suit for all of the application scenarios with information explosion.At this time,cloud computing generalized by the market quickly caused by the ability to deal with large-scale database through ordinary computers.It can establish computational resource pool to deal with the lack of ability to calculate through single server.Distributed computation replaces server of ordinary PC as one of the most important branches for cloud computing.Compared with traditional equipment,it owns a large number of characteristics,like inexpensive,high-reliability,easily-extend,which can support to transplantation of algorithm.But in the practical process of distribution,the operation for division and sequence dispatching will cause huge consumption in time and space.As a result,this paper design IO structure and realize a data center to translate middle results and decrease extra consumption.At first,we distribute the traditional attribute reduction algorithm into distributed cluster and adjust the model of data dealing and R/W method.In the next part,the paper verifies time astringency under large-scale databases and analyzes problems during operation steps,like repeated start of computing framework and frequent mutual operation between middle results.Afterwards,this paper proposes the particular design ideas and realizes an R/W data center that based on flash memory as the neatly cache interval of framework.In order to avoid computation delay in translation between intermediate result and disk,all of the data can be connected to data center.As a result,the design can take advantage of limited resources,including pc and server.Data center combines many mainstream models to provide stabilize and efficient cache mechanism,like shared memory,multithreading,singleton pattern,streaming and so on.In addition,the proposed framework need redirect communication and data-encapsulation strategy in system class to change the original method.Modules in data center also package methods of data dealing and combining to support read from cluster.After optimizing,the proposed framework will sustain the distributed attribute reduction algorithm in good efficient and decreases running time.More in more,it promotes a reliable extension in both parallel task and data scale under large-scale databases,which rely on cluster and data center.At the last part of experiment validates operating efficiency through adopting different kinds of databases in Wisconsin Breast Cancer Database.This s ection verifies high efficiency in theory and practice by analyzing of operation time,chain index and global astringency and so on.It confirms the innovation and preciseness at the same time.
Keywords/Search Tags:Covering rough set, Attribute reduction, Distributed computation, Cache mechanism, Hadoop
PDF Full Text Request
Related items