Font Size: a A A

Research And Implementation On Feature Extraction And Classification Of Chinese Text Based On SPARK

Posted on:2018-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:C D XieFull Text:PDF
GTID:2348330512982988Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosive growth of digital Chinese text information,that how to mine the value of these data quickly and effectively has become a challenge in front of people.Chinese text classification is one of the key technologies in Chinese text processing and analysis,which can help people solve the problem of information disorder.When the data size is relatively large,the classification of the stand-alone version will be the bottleneck of storage and computing speed.We can solve the problem by means of the current distributed storage and distributed computing technology.HDFS is one of the core modules of Hadoop,and can well meet the needs of distributed storage as a distributed file system.Spark,the most obvious feature of which is the use of memory for computing,is the successor to the MapReduce,and faster than MapReduce.In this thesis,the main tasks of Chinese text feature extraction and classification are as follows:(1)A new method for text feature extraction is proposed and compared with the traditional feature extraction method.The feature extraction method takes into account the frequency distribution of words in the class and between classes,and uses the variance in statistics to describe the importance of words in the text classification.(2)The text feature representation method based on vector distribution of document categories and the text classification method based on the election thought are studied.In the classical text feature representation method,an element of the document vector corresponds to a term in the document.But in the method of this thesis,an element of the document vector corresponds to the probability estimate that the document belongs to the document.In order to get the probability estimation,this thesis explores two approaches,one is based on Naive Bayes,and the other is based on the idea of election.Unlike the Naive Bayes method,there is an assumption of independence.The text classification method based on the election thought only treats the words as voters,and they vote together to decide which category the document belongs to.For each word in the training dataset,through statistical analysis,we can obtain the probability estimates that they belong to each category.The probability estimate can be regarded as a vote for each category.In this thesis,two kinds of election strategiest are proposed,which are based on BIM and MM,and we also consider the case where each word has different voting weights.At last,improve text classification based on LDA using the methods mentioned above.In the traditional way,LDA gets the topic vector distribution of the test dataset through the Gibbs Sampling,where there is a slow problem.This thesis explores the topic vector distribution of the test dataset based on the election idea,and reacquires the topic vector distribution of the training dataset through the same method.Finally,test dataset is classified by the classifier,and the speed and classification effect are improved.(3)On the HDFS and Spark platforms,the MLlib algorithm component is used to realize two methods,one of which classifies the news corpus best,and the other one classifies the microblogging corpus best.
Keywords/Search Tags:Text classification, Feature extraction, LDA, Naive Bayes, Spark
PDF Full Text Request
Related items