Font Size: a A A

Research And Implementation Of Spark-based Text Classification

Posted on:2018-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:S B ZhangFull Text:PDF
GTID:2428330569485420Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification is a hot topic in the field of Natural Language Processing and data mining.With the continuous exploration of text classification technology,the research content of text classification technology has also been subdivided,including text representation,text preprocessing,feature dimension reduction,classification technology has gradually become the core research direction of text classification.First of all,based on a detailed understanding of the various steps of text classification and unsupervised topic model on the principle,design and implementation of a Spark and LDA(Latent Dirichlet Allocation)based on unsupervised text classification prototype system topic model.The prototype system uses the vector space representation method,and then through the TF-IDF(term frequency-inverse document frequency)feature selection method of text vector first feature reduction,after using LDA for second times to reduce the dimension of text vector,and generate the topic model,and the use of a classification algorithm of support vector machine to achieve various schemes.Text classification,the whole process does not require any manual on corpus preprocessing operation.Then,in the aspect of system design,combined with the service architecture concept,the system function module is reasonably split,so that part of the module from the autonomous system,reduce the degree of coupling between modules,improve the stability and reusability of system module.Finally,the experiment performance and classification results,based on small scale clusters,the average response time prediction function does not exceed 3S,the balance in the long text corpus of the average accuracy of more than 90%,to verify the feasibility of the system.
Keywords/Search Tags:Text Classification, Feature Dimensionality Reduction, Topic Model, Support Vector Machine, Latent Dirichlet Allocation
PDF Full Text Request
Related items