Font Size: a A A

A Research Of Text Feature Selection Algorithm Based On Cloud Platform

Posted on:2017-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:J F WangFull Text:PDF
GTID:2348330488996089Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently,with the rapid development of Internet information technology and the industry chain,the increasing growth of the industry data goes faster,particularly in the data center which support the internet business and social communication service.Although the mount of the data is huge as basic transaction can be tracked,the data structure is so confusing that the large-scale data haven't been categorized and the implied value can't be extracted,which easily get people into the trouble and could do nothing with the vast data.Lots of research shows that different step has different degree of impact on the effect of the final text categorization,particularly,the feature selection which is usually regarded as the core role.At the same time,feature selection can effectively solve the problem about high computational complexity and low classification accuracy which is caused by the high-dimensional sparse matrix.As classic text feature selection algorithms had not comprehensively evaluate the word frequency within the document,the degree of concentration among classes and the dispersion within the class,an improved combination of feature selection algorithm called CHMI is proposed based on the chi-square statistic(CHI)and mutual information(MI).Classic text feature selection algorithm was compared to verify the CHMI is better than the classical algorithms on the classification results based on the open Chinese corpus.Although CHMI algorithm proposed has certain advantage over classical text classification feature selection algorithm,when faced large-scaled data sets,this algorithm still can't get over its own algorithm complexity which leading the time consumption and space consumption problems.So at last we combine Map Reduce model with CHMI algorithm,proposing a cloud text feature selection algorithm called MRCHMI based on Hadoop.Compared with single node environment.Experiments show that the algorithm MRCHMI improve time efficiency with no affect on classification.
Keywords/Search Tags:Feature selection, text classification, mapreduce, CHMI, Hadoop, MRCHMI
PDF Full Text Request
Related items