Research And Of Implementation Of Distributed High Order Pure Dependence Algorithm

Posted on:2015-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:X M Ruan

Full Text:PDF

GTID:2298330452459575

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Many theories and technology of classical IR regularly encounter limitations dueto big data sets in many filed in the cloud times. Recently, many data mining methodshave been proposed to mine useful word associations from text documents. However,it’s of great challenge to efficiently discover high-order patterns, especially in therapidly expanding data collection. The aim here is to find high-order pure wordassociations in big data sets. Here by “pure”, we mean that those words form anun-separable semantic entity, i.e. the high-order dependent word associations thatcannot be reduced to the random coincidence of low-order dependent wordassociations.In addition, To solve the massive data calculation caused by informationexplosion, we make best use of GFS and HDFS, proposed by Google and ApacheSoftware Foundation respectively, to storage massive data, MapReduce distributedcomputing model to compute big data problems, BigTable and HBase to real-timequery and update massive data. As high reliability, high scalability, high efficiency,high fault tolerance and open source free of Hadoop, Hadoop distributed processingsoftware architecture can be used to mine high-order pure dependence wordrelationship efficiently.This paper proposes distributed pure dependence (PD) mining algorithm based onInformation Gemometry which can efficiently mine the pure dependence indistributed environment: Distributed Pairwise Pure Dependence Mining algorithm andDistributed Theta Pure Dependence Mining algorithm, we call Distributed PureDependence Mining (DPDM) algorithm in this paper. For MapReduce processes hugetasks with a great deal of data and a lot of computing, DPDM utilizes MapReduce tomake full use of any resource of tasktrackers in the distributed cluster. In this paper,DPDM uses MapReduce to do Log Likelihood Ratio Test (LLRT) by averaging allthe computing tasks to all tasktrackers, combines multi-threads programming to doparallel computations in order to make best full of multiple cores of the same machine.In order to enable DPDM real-time access data from distributed file system, thusDPDM algorithm take advantage of HBase to store all term’s statistical information.In addition, we construct an integrated distributed pure dependence mining framework,including content extractor, stop-word filter, word stemmer, index builder, pattern miner etc. DPDM framework can atomically mine pure dependence patterns from bigdata.Extensive study in this paper shows that DPDM algorithm can significantly speedup during mining the high order pattern in large data sets. Then we apply the extractedhigh-order patterns in the tasks such as text classification task and achieve significantimprovement.But Distributed Pairwise Pure Dependence Mining algorithm andDistributed Theta Pure Dependence Mining algorithm still have deficiencies in datastructure and algorithm, need to be improved.In addition,we will study how toeffectively imply the PD word relationship into classic IR tasks, Image Retrieval,Sound Retrivel, Video Retrivel and so on.

Keywords/Search Tags:

big data, pure dependence, distributed, word relationship, DPDMalgorithm

PDF Full Text Request

Related items

1	The Pure Relationship And Self-identity Through The Self-presentation In The Network
2	Reinforcing The Topic Of Embeddings With Theta Pure Dependence For Text Classification
3	Research On The Mobile Social Media Dependence And Satisfaction Of Interpersonal Relationship Among College Students
4	A Study On Parallel Dependence Relations Decomposition Of Programmable Logic Controller
5	Dependence analysis for distributed event-based systems
6	Research And Realization Of The Network Fault Management System Based On Dependence Relationship
7	Research Of The Alignment Between Features Of Space Relationships In 2D Images And Describing Words
8	Research On The Influence Of Relationship Intensity On Internet Word-of-Mouth Propagation Effect
9	A Data Placement Algorithm Based On Data Dependence
10	Research On API Recommendation Technology Based On Class Inheritance Relationship Analysis