Font Size: a A A

Research And Implementation On Parallel PP Model With PSO For Text Classification

Posted on:2013-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y P HuangFull Text:PDF
GTID:2298330377459819Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification is a very important topic in data mining, machine learning and othersfields, which is widely used in industry, business and scientific research. With thedepth extension of the classification applications and the explosive growth of data,high dimensional data and large-scale data analysis and research is becoming morecommon and important, and usually the dataset processed is both high-dimensionaland large-scale. In the processing of high-dimensional data, with the growth ofdimension and data size the calculation becomes greater and greater, the efficiency ofexecution is getting low, and high-dimensional data often exists with "the curse ofdimensionality" and other problems. In practical applications, deal withhigh-dimensional data in dimension reduction and make the corresponding algorithmmodel parallelization are effective ways to improve the high-dimensional data andlarge-scale data processing performance.Projection Pursuit is a new statistical method used to process and analyzehigh-dimensional observation data, particularly non-normal, non-linearhigh-dimensional data. Because it has no data distribution normality assumption, so itcan better keep the original features of the data; it is widely used in high-dimensionaldata analysis. There is a projection index optimization process the process ofprojection pursuit model. On the optimization problem, there are already manyexperts and scholars had put forward many different ways, such as Particle SwarmOptimization, GA, Ant Colony Algorithm and so on.MapReduce is proposed by Google, it is a model of parallel computing mainly forlarge-scale dataset processing. MapReduce takes care of the details of partitioning theinput data, scheduling the program’s execution across a set of machines, handlingmachine failures, and managing the required inter-machine communication.MapReduce programming model make us easier to write parallel programming, todeal with large-scale and high dimension data.This paper applies Projection Pursuit model to dimension reduction and usesPSO to find the optimal projection direction for texts to support text classification, itprograms the model on MapReduce model. In the classification stage, KNN based onMapReduce is designed and used, and the classification experiment is performed on Fudan dataset. The result shows that parallel particle swarm optimization forprojection pursuit based on MapReduce has both well effectiveness and higherefficiency than its serial counterpart.
Keywords/Search Tags:Projection Pursuit, Particle Swarm Optimization, MapReduce, Text Classification, KNN, Parallel
PDF Full Text Request
Related items