Parallel Implementation On Document Classification And Similarity Analysis

Posted on:2017-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:W Huang

Full Text:PDF

GTID:2348330512962123

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of artificial intelligence, machine learning has attracted more and more attention.Classification, as an important part of the machine learning, has been widely applied in many fields. This paper focuses on text classification algorithms, and applies them in software bug reports dataset which were submitted via internet worldwide. With the coming of big data era, it is unable to screen all available data in the internet by using human power. It calls for automatic computational methods to deal with huge volume of data. The problem to improve the existing algorithms'performance becomes critical. Inside the massive data, there are repetitive entries, or even similar items which may different in exact words but having the same meaning. If the repetition data can be identified and the similar entries can be merged automatically, it saves human labor greatly. This paper first proposes a novel feature dimension reduction method and then presents two improved distributed classification algorithms for big data. Together, the main contribution of this paper are listed in the following three aspects.Firstly, it proposes a novel feature dimension reduction method. It is well-known that the software bug reports data are subjective which is greatly depended on the different submitters. We improved TF-IDF algorithm by considering with the distribution of word frequency in the same category and among different categories. Then it uses a special processing method to improved feature reduction result.Secondly, it extends polynomial Naive Bayes classification algorithm to the distributed computational environment. By considering the effect of text classification entries between classes and within one specific class, it proposes and implements an improved polynomial Naive Bayes algorithm in the cloud environment by using MapReduce framework. The accuracy and running time are improved greatly. Thirdly, it presents an improved text similarity computing method, and implements it in the Hadoop cluster and Spark cluster. Through the study of document similarity calculation, it improves feature weights calculation method based on the Google's SimHash algorithm. By decreasing the pair comparison of two hash value, it successfully speeds up running time. It is implemented in the Hadoop and Spark these two clusters enviornment. Taken together, it improves the accuracy and running time.

Keywords/Search Tags:

Text Classification, Feature Dimension Reduction, Polynomial Naive Bayes, Similarity Calculation, Distributed Computing

PDF Full Text Request

Related items

1	Research On Text Classification Algorithm Based On Naive Bayes Method
2	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
3	Text Classification Algorithm Research Based On Naive Bayes
4	Text Categorization Based On Naive Bayes Method
5	Reasearch On Text Classification In The Application Of Customer Complaint Prediction Of Operator
6	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm
7	A Text Classifier About High Blood Pressure Based On Naive Bayes
8	Design And Implementation Of Text Classification System Based On K-neighborhood And Naive Bayesian
9	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
10	Research On Feature Dimension Reduction In Text Classification