Research Of Text Feature Selection Algorithm Based On Hadoop

Posted on:2016-05-12

Degree:Master

Type:Thesis

Country:China

Candidate:J Xu

Full Text:PDF

GTID:2308330461467285

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of information industry, data presents exponential growth. The data is so large, but the data is disorganized. Huge amounts of data can’t be classified well that leads to potential information hidden. In this situation, we always feel that we are lack of knowledge in rich information. Text categorization is regarded as the basis of data mining and information retrieval, which can classify the disorganized information reasonably. The efficiency of text classification algorithm by using traditional serial method are very well in the face of small data set, but in the large amounts of data sets, that is powerless. However, the traditional parallel computing framework is so complicated that we have to understand the underlying details. In recent years, an open source distributed platform called Hadoop has developed very rapidly, which provides a parallel programming framework MapReduce and a distributed file system, and it makes the efficiency of storage and processing of mass data higher and higher. When we deal with the huge amounts of data for text categorization, Hadoop provides a new solution. Therefore, the research of parallel text classification has very practical significance.The methods used in various stages of text classification have been found that has a decisive influence on the performance of text classification in serial environment. The feature selection stage is especially important for text classification, which evaluates the features by an evaluation function. Then the features with the bigger evaluation value are selected. After analyzing the used evaluation function and considering all kinds of factors like of inter-class and intra-class, we propose a novel feature selection algorithm called CCD (Category Correlation Degree). In order to test the performance of this method, we use two data sets in different sizes to do experiments, respectively. Compared with the traditional feature selection algorithms on the data sets for the feature selection, the experimental results show that both on the small data set and large data set the proposed feature selection algorithm in this paper is superior to other feature selection algorithms in terms of classification performances. Although the CCD method proposed in this paper has certain advantages in the text classification performance, the method still cannot solve the time and space consumption problems in the face of large data sets.Text classification includes the process of word segmentation, feature selection and feature weight, which are all face up with large amount of calculation, so the time complexity and space complexity are very high. For these problems, this paper combines the advantage of Hadoop in mass data storage and processing to parallel the text classification. Finally we test the same data set in the parallel environment, the results show that the running time of parallel environment is far less than serial environment, but the classification accuracy don’t differ much at all.

Keywords/Search Tags:

Text Classification, feature selection, Category Correlation Degree, Hadoop, MapReduce

PDF Full Text Request

Related items

1	A Research Of Text Feature Selection Algorithm Based On Cloud Platform
2	The Research Of Big Data Text Classification Method Based On Mapreduce
3	Research On Feature Selection Method Based On Text Category Relevance Degree And Latent Semantic Analysis
4	Research Of Feature Selection Method For Chinese Text Classifization
5	Research Of Hierarchical Text Classification Methods Based On Category Structure
6	Research And Implementation Of Automatic Text Classification Based On Hadoop
7	Research And Implementation Of Feature Selection In Chinese Text Classification
8	Research Of Feature Selection And Weighting Algorithm In Text Classification System Based On SVM
9	Research On Feature Selection And Feature Weighting Of Text Classification
10	The Research Of MapReduce Implementing Of Text Classification KNN Algorithm Based On Mass Data