Font Size: a A A

Research And Of Implementation Of Distributed High Order Pure Dependence Algorithm

Posted on:2015-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:X M RuanFull Text:PDF
GTID:2298330452459575Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Many theories and technology of classical IR regularly encounter limitations dueto big data sets in many filed in the cloud times. Recently, many data mining methodshave been proposed to mine useful word associations from text documents. However,it’s of great challenge to efficiently discover high-order patterns, especially in therapidly expanding data collection. The aim here is to find high-order pure wordassociations in big data sets. Here by “pure”, we mean that those words form anun-separable semantic entity, i.e. the high-order dependent word associations thatcannot be reduced to the random coincidence of low-order dependent wordassociations.In addition, To solve the massive data calculation caused by informationexplosion, we make best use of GFS and HDFS, proposed by Google and ApacheSoftware Foundation respectively, to storage massive data, MapReduce distributedcomputing model to compute big data problems, BigTable and HBase to real-timequery and update massive data. As high reliability, high scalability, high efficiency,high fault tolerance and open source free of Hadoop, Hadoop distributed processingsoftware architecture can be used to mine high-order pure dependence wordrelationship efficiently.This paper proposes distributed pure dependence (PD) mining algorithm based onInformation Gemometry which can efficiently mine the pure dependence indistributed environment: Distributed Pairwise Pure Dependence Mining algorithm andDistributed Theta Pure Dependence Mining algorithm, we call Distributed PureDependence Mining (DPDM) algorithm in this paper. For MapReduce processes hugetasks with a great deal of data and a lot of computing, DPDM utilizes MapReduce tomake full use of any resource of tasktrackers in the distributed cluster. In this paper,DPDM uses MapReduce to do Log Likelihood Ratio Test (LLRT) by averaging allthe computing tasks to all tasktrackers, combines multi-threads programming to doparallel computations in order to make best full of multiple cores of the same machine.In order to enable DPDM real-time access data from distributed file system, thusDPDM algorithm take advantage of HBase to store all term’s statistical information.In addition, we construct an integrated distributed pure dependence mining framework,including content extractor, stop-word filter, word stemmer, index builder, pattern miner etc. DPDM framework can atomically mine pure dependence patterns from bigdata.Extensive study in this paper shows that DPDM algorithm can significantly speedup during mining the high order pattern in large data sets. Then we apply the extractedhigh-order patterns in the tasks such as text classification task and achieve significantimprovement.But Distributed Pairwise Pure Dependence Mining algorithm andDistributed Theta Pure Dependence Mining algorithm still have deficiencies in datastructure and algorithm, need to be improved.In addition,we will study how toeffectively imply the PD word relationship into classic IR tasks, Image Retrieval,Sound Retrivel, Video Retrivel and so on.
Keywords/Search Tags:big data, pure dependence, distributed, word relationship, DPDMalgorithm
PDF Full Text Request
Related items