Font Size: a A A

Research Of Text Feature Selection Algorithm Based On Hadoop

Posted on:2016-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2308330461467285Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information industry, data presents exponential growth. The data is so large, but the data is disorganized. Huge amounts of data can’t be classified well that leads to potential information hidden. In this situation, we always feel that we are lack of knowledge in rich information. Text categorization is regarded as the basis of data mining and information retrieval, which can classify the disorganized information reasonably. The efficiency of text classification algorithm by using traditional serial method are very well in the face of small data set, but in the large amounts of data sets, that is powerless. However, the traditional parallel computing framework is so complicated that we have to understand the underlying details. In recent years, an open source distributed platform called Hadoop has developed very rapidly, which provides a parallel programming framework MapReduce and a distributed file system, and it makes the efficiency of storage and processing of mass data higher and higher. When we deal with the huge amounts of data for text categorization, Hadoop provides a new solution. Therefore, the research of parallel text classification has very practical significance.The methods used in various stages of text classification have been found that has a decisive influence on the performance of text classification in serial environment. The feature selection stage is especially important for text classification, which evaluates the features by an evaluation function. Then the features with the bigger evaluation value are selected. After analyzing the used evaluation function and considering all kinds of factors like of inter-class and intra-class, we propose a novel feature selection algorithm called CCD (Category Correlation Degree). In order to test the performance of this method, we use two data sets in different sizes to do experiments, respectively. Compared with the traditional feature selection algorithms on the data sets for the feature selection, the experimental results show that both on the small data set and large data set the proposed feature selection algorithm in this paper is superior to other feature selection algorithms in terms of classification performances. Although the CCD method proposed in this paper has certain advantages in the text classification performance, the method still cannot solve the time and space consumption problems in the face of large data sets.Text classification includes the process of word segmentation, feature selection and feature weight, which are all face up with large amount of calculation, so the time complexity and space complexity are very high. For these problems, this paper combines the advantage of Hadoop in mass data storage and processing to parallel the text classification. Finally we test the same data set in the parallel environment, the results show that the running time of parallel environment is far less than serial environment, but the classification accuracy don’t differ much at all.
Keywords/Search Tags:Text Classification, feature selection, Category Correlation Degree, Hadoop, MapReduce
PDF Full Text Request
Related items