Font Size: a A A

Parallel Implementation On Document Classification And Similarity Analysis

Posted on:2017-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:W HuangFull Text:PDF
GTID:2348330512962123Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence, machine learning has attracted more and more attention.Classification, as an important part of the machine learning, has been widely applied in many fields. This paper focuses on text classification algorithms, and applies them in software bug reports dataset which were submitted via internet worldwide. With the coming of big data era, it is unable to screen all available data in the internet by using human power. It calls for automatic computational methods to deal with huge volume of data. The problem to improve the existing algorithms'performance becomes critical. Inside the massive data, there are repetitive entries, or even similar items which may different in exact words but having the same meaning. If the repetition data can be identified and the similar entries can be merged automatically, it saves human labor greatly. This paper first proposes a novel feature dimension reduction method and then presents two improved distributed classification algorithms for big data. Together, the main contribution of this paper are listed in the following three aspects.Firstly, it proposes a novel feature dimension reduction method. It is well-known that the software bug reports data are subjective which is greatly depended on the different submitters. We improved TF-IDF algorithm by considering with the distribution of word frequency in the same category and among different categories. Then it uses a special processing method to improved feature reduction result.Secondly, it extends polynomial Naive Bayes classification algorithm to the distributed computational environment. By considering the effect of text classification entries between classes and within one specific class, it proposes and implements an improved polynomial Naive Bayes algorithm in the cloud environment by using MapReduce framework. The accuracy and running time are improved greatly. Thirdly, it presents an improved text similarity computing method, and implements it in the Hadoop cluster and Spark cluster. Through the study of document similarity calculation, it improves feature weights calculation method based on the Google's SimHash algorithm. By decreasing the pair comparison of two hash value, it successfully speeds up running time. It is implemented in the Hadoop and Spark these two clusters enviornment. Taken together, it improves the accuracy and running time.
Keywords/Search Tags:Text Classification, Feature Dimension Reduction, Polynomial Naive Bayes, Similarity Calculation, Distributed Computing
PDF Full Text Request
Related items