Research On Text Classification Method Based On Hadoop

Posted on:2020-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:Z L Bai

Full Text:PDF

GTID:2428330590979402

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the rapid development and application of Internet technology,the amount of network information data is exploding,and the big data application brings challenges to data analysis and text classification technology.In the face of big data application scenarios and data storage structures,research on basic analysis and classification methods for large and large amounts of data is particularly important.Only by analyzing the information we want from the data,big data has its own value,which is called data wealth.Through in-depth research and analysis,it can be found that each stage of text classification has different degrees of influence on the final effect of classification,and the core of determining whether the classification algorithm is excellent or not is often reflected in the feature selection,and the good feature selection method can also be a certain degree.The high computational complexity caused by the high-dimensional sparse data features that often appear in the mitigation classification problem has the problem of the classification accuracy rate.Therefore,in order to cope with the development of the era of big data and realize the value of data,this paper starts with the following two aspects to study the classification method of big data:1.Aiming at the problem of multidimensional information extraction encountered in big data analysis and processing,a text classification method is proposed.This method mainly improves the feature extraction in the text classification process.Aiming at the problem that the traditional chi-square statistics(CHI)is too large in selecting feature words,a T-CHI feature selection algorithm combining synonyms is proposed.Use How-net to calculate word similarity and merge synonyms,thus reducing the dimension of feature space and improving the accuracy of text classification.2.For the text classification problem of big data,this paper combines the proposed text classification algorithm with the Hadoop framework to achieve fast processing of data.As a distributed processing system that combines storage and operation,Hadoop implements a distributed file system(HDFS)and a distributed framework(MapReduce)for storing data and parallel computing,respectively.Its unique advantages combine the Hadoop platform with text classification technology,and the time cost and memory consumption of the classification work will be significantly reduced.In this paper,the feature selection algorithm in text categorization is improved,and the improved text categorization method is combined with Hadoop platform.A text categorization method based on Hadoop is proposed.This method makes full use of Hadoop's excellent features to improve the efficiency of text categorization.Finally,experiments show that the method can reduce the execution time when processing large amounts of data.

Keywords/Search Tags:

Feature selection, T-CHI, Hadoop, Text categorization

PDF Full Text Request

Related items

1	The Research Of Text Representation And Feature Selection In Text Categorization
2	Theoretical Analysis And Algorithm Study On Feature Selection For Text Categorization
3	A Study On Text Categorization Based On Machine Learning
4	Normal Weight Based Feature Selection Method In SVM Text Categorization
5	Related Technologies Research On Feature Selection For Text Categorization
6	Feature Selection Methods For Text Categorization
7	Research And Implement Of Chinese Multi-Selection Text Categorization System Based On Hadoop
8	X ~ 2 Statistics-based Chinese Text Categorization Feature Selection Method
9	Research On Text Categorization Based On LDA And SVM
10	Research On Feature Selection And Classification Methods For Text Categorization