Font Size: a A A

Research On Text Classification Method Based On Hadoop

Posted on:2020-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z L BaiFull Text:PDF
GTID:2428330590979402Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the rapid development and application of Internet technology,the amount of network information data is exploding,and the big data application brings challenges to data analysis and text classification technology.In the face of big data application scenarios and data storage structures,research on basic analysis and classification methods for large and large amounts of data is particularly important.Only by analyzing the information we want from the data,big data has its own value,which is called data wealth.Through in-depth research and analysis,it can be found that each stage of text classification has different degrees of influence on the final effect of classification,and the core of determining whether the classification algorithm is excellent or not is often reflected in the feature selection,and the good feature selection method can also be a certain degree.The high computational complexity caused by the high-dimensional sparse data features that often appear in the mitigation classification problem has the problem of the classification accuracy rate.Therefore,in order to cope with the development of the era of big data and realize the value of data,this paper starts with the following two aspects to study the classification method of big data:1.Aiming at the problem of multidimensional information extraction encountered in big data analysis and processing,a text classification method is proposed.This method mainly improves the feature extraction in the text classification process.Aiming at the problem that the traditional chi-square statistics(CHI)is too large in selecting feature words,a T-CHI feature selection algorithm combining synonyms is proposed.Use How-net to calculate word similarity and merge synonyms,thus reducing the dimension of feature space and improving the accuracy of text classification.2.For the text classification problem of big data,this paper combines the proposed text classification algorithm with the Hadoop framework to achieve fast processing of data.As a distributed processing system that combines storage and operation,Hadoop implements a distributed file system(HDFS)and a distributed framework(MapReduce)for storing data and parallel computing,respectively.Its unique advantages combine the Hadoop platform with text classification technology,and the time cost and memory consumption of the classification work will be significantly reduced.In this paper,the feature selection algorithm in text categorization is improved,and the improved text categorization method is combined with Hadoop platform.A text categorization method based on Hadoop is proposed.This method makes full use of Hadoop's excellent features to improve the efficiency of text categorization.Finally,experiments show that the method can reduce the execution time when processing large amounts of data.
Keywords/Search Tags:Feature selection, T-CHI, Hadoop, Text categorization
PDF Full Text Request
Related items