Font Size: a A A

The Research Of Big Data Text Classification Method Based On Mapreduce

Posted on:2016-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ChenFull Text:PDF
GTID:2308330470473762Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
High speed development of mobile and Internet networks has brought convenience to people’s lives, a lot of data uploaded to the network moment by moment, according to statistics the amount of data will grow by nearly 10 times to the year 2020, followed by the number of texts grow up on the web at an alarming rate. In the background of big data, how to extract valuable information from the mass of texts is crucial. For this reason, as an important basis for text mining and information retrieval, text classification has attracted widespread attention.On the one hand, the first step for the classification of Chinese text is Chinese word segmentation operation, but Chinese vocabulary is up to hundreds of thousands, the large-scale text set with all segmentation words to build feature thesaurus will have a huge dimension. Therefore, feature selection and extraction is necessary.On the other hand, the traditional centralized system framework does not meet the requirements of big data analysis and processing. Google’s MapReduce parallel programming model, to create the conditions for large-scale dataset text classification. Hadoop Distributed System offers open source framework MapReduce programming model, whose principles like Google’s MapReduce design.In this paper, select Hadoop the Java-based design and development open source MapReduce parallel computing framework and systems, mainly do the following:(1)To study the relevant processes and technologies of text classification.Introduce the important part of the text classification in detail, such as feature selection, text classification algorithms.For the Hadoop platform is becoming a powerful tool for big data processing, this paper to do a more in-depth study on its MapReduce parallel programming model and distributed file system HDFS.(2) Taking into account the pre-processing Chinese text categorization is needed segmentation and stop word removal operation, after comparing two kind of paralleled method, choose the much more efficient one to design a set of segmentation integrated stop word removal paralleled framework based on MapReduce. A large-scale input of text dataset, always need feature selection, to extract several most representative feature items, reducing the dimension of the feature space. Analyze the deficiency of traditional mutual information feature selection algorithm, and research on others improved this algorithm, propose feature selection method CDMT based on the difference between the types, and design CDMT parallelization feature selection framework based on MapReduce.(3) The MapReduce framework is applied to the field of text classification, based on the analysis Naive Bayesian classification algorithm, designed a MapReduce-based Naive Bayes classification paralleled framework. Then Combined with feature selection method for testing, building a Hadoop experimental environment, and experiments prove that using CDMT feature selection algorithm to extract items has achieved better classification performance characteristics.
Keywords/Search Tags:MapReduce, Big Data, Feature Selection, Hadoop, Text Classification
PDF Full Text Request
Related items