The Research Of Big Data Text Classification Method Based On Mapreduce

Posted on:2016-10-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y C Chen

Full Text:PDF

GTID:2308330470473762

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

High speed development of mobile and Internet networks has brought convenience to people’s lives, a lot of data uploaded to the network moment by moment, according to statistics the amount of data will grow by nearly 10 times to the year 2020, followed by the number of texts grow up on the web at an alarming rate. In the background of big data, how to extract valuable information from the mass of texts is crucial. For this reason, as an important basis for text mining and information retrieval, text classification has attracted widespread attention.On the one hand, the first step for the classification of Chinese text is Chinese word segmentation operation, but Chinese vocabulary is up to hundreds of thousands, the large-scale text set with all segmentation words to build feature thesaurus will have a huge dimension. Therefore, feature selection and extraction is necessary.On the other hand, the traditional centralized system framework does not meet the requirements of big data analysis and processing. Google’s MapReduce parallel programming model, to create the conditions for large-scale dataset text classification. Hadoop Distributed System offers open source framework MapReduce programming model, whose principles like Google’s MapReduce design.In this paper, select Hadoop the Java-based design and development open source MapReduce parallel computing framework and systems, mainly do the following:(1)To study the relevant processes and technologies of text classification.Introduce the important part of the text classification in detail, such as feature selection, text classification algorithms.For the Hadoop platform is becoming a powerful tool for big data processing, this paper to do a more in-depth study on its MapReduce parallel programming model and distributed file system HDFS.(2) Taking into account the pre-processing Chinese text categorization is needed segmentation and stop word removal operation, after comparing two kind of paralleled method, choose the much more efficient one to design a set of segmentation integrated stop word removal paralleled framework based on MapReduce. A large-scale input of text dataset, always need feature selection, to extract several most representative feature items, reducing the dimension of the feature space. Analyze the deficiency of traditional mutual information feature selection algorithm, and research on others improved this algorithm, propose feature selection method CDMT based on the difference between the types, and design CDMT parallelization feature selection framework based on MapReduce.(3) The MapReduce framework is applied to the field of text classification, based on the analysis Naive Bayesian classification algorithm, designed a MapReduce-based Naive Bayes classification paralleled framework. Then Combined with feature selection method for testing, building a Hadoop experimental environment, and experiments prove that using CDMT feature selection algorithm to extract items has achieved better classification performance characteristics.

Keywords/Search Tags:

MapReduce, Big Data, Feature Selection, Hadoop, Text Classification

PDF Full Text Request

Related items

1	Research Of Text Feature Selection Algorithm Based On Hadoop
2	A Research Of Text Feature Selection Algorithm Based On Cloud Platform
3	The Research Of MapReduce Implementing Of Text Classification KNN Algorithm Based On Mass Data
4	Research And Implementation Of Automatic Text Classification Based On Hadoop
5	Research On Text Classification Method Based On Hadoop
6	Application Research Of Text Classification Based On Hadoop Platform
7	The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data
8	Design And Implementation Of Text Classification System Based On Hadoop Platform
9	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
10	Researches On Feature Selection In Text Classification