| With the continuous expansion of the network size and the continuous increaseof information continues, text classification of centralized environment cannot meetthe existing needs, so large-scale data processing in a distributed environmentbecomes the focus of attention of the current IT industry. Large-scale data processingfor text classification is needed in advertising or in the field of information retrieval,so to study large-scale data text categorization research in cloud computingenvironment has become a focus. This article studies text classification algorithm andits incremental algorithm, premised on the text classification and based on theproposed inverted index tree structure, under the Hadoop platform.To sum up, the main research achievements, contributions and innovation can besummarized in the following points:1. This article proposes inverted index tree structure and parallels it on the cloudplatform, in order to improve the computing speed of the feature selectionmethods and to meet text classification algorithms such as KNN and Bayes, andto distribute sloppy according to the text vector latitude.2. Based on inverted index tree structure and text classification algorithm, this articleroposes massive data inverted index tree construction algorithm and its pruningstrategy, while presents incremental inverted index tree algorithm and its paralleldesign.3. Based on inverted index tree structure, this article designs the K-meansincremental classification algorithm, proposes the parallelization of the algorithmclassification under the Hadoop platform.4. Based on inverted index tree structure, this article proposes under cloudcomputing the Hadoop platform based inverted-index tree naive Bayes classifieralgorithm and three improved methods of the algorithm, respectively usingTFIDF the right weight weighted mutual information weighted expectedcross-entropy weighted Naive Bayesian text classification algorithm, while presents the Local Naive Bayesian text classification algorithm based oninverted-index tree.5. Based on doing experimental analysis by building Hadoop cluster, this articleverifies the inverted index tree structure and the classification accuracy, recall rateand classification performance of improved method of its text classification. |