Font Size: a A A

Text Classification Algorithm Based On Deep Learning And Support Vector Machine

Posted on:2022-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ShiFull Text:PDF
GTID:2518306764968409Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,information is growing explosively.How to mine valuable information from massive internet texts and realize efficient management and utilization of texts is a very challenging problem.Under this background,natural language processing(NLP)technology came into being.As one of the important research directions in NLP,the challenges faced by text classification technology mainly include high-dimensional sparsity of text representation,classifier efficiency and sample noise.This thesis studies the above problems.The research contents and main contributions are as follows:(1)This thesis introduces Word2Vec and Glove word embedding theory from deep learning field,and combines with traditional feature extraction methods to carry out word embedding vectorization text representation.The results of text classification under dif-ferent text feature extraction methods,word embedding methods and text vector dimen-sions is explored.From the perspective of feature extraction,the average word embedding weighting is better than the word embedding weighting based on TF-IDF.The word em-bedding vectorization methods based on deep learning is generally better than traditional feature extraction methods,not only the F1score is higher,but also the average classifier training time is only 1.59%of the traditional methods.From the perspective of word em-bedding methods,Glove word embedding is better than Word2Vec in most cases.From the perspective of text vector dimension,the corresponding classification performance will improve with the increase of the text representation dimension.(2)A parallel data geometric analysis(PDGA)algorithm is proposed.The algorithm is used to finely select training samples for Support Vector Machine(SVM),so as to realize sample reduction,and then maximize the training speed while maintaining the classifica-tion performance of SVM.The rationality and validity of PDGA algorithm are verified on 2 text datasets,4 UCI non-text datasets and 1 artificial dataset.The results show that PDGA algorithm has the fastest execution speed and better classification performance retention ability among the same algorithms.(3)A multivariate outlier recognition method based on Mahalanobis distance com-bined with quantile idea is proposed.This method is designed to alleviate the problem of sample noise.This method is executed in the process of executing PDGA algorithm,which makes full use of the statistical information of the algorithm.Experimental results show that the proposed algorithm will not increase the time complexity and memory bur-den of PDGA algorithm,and can reasonably remove the noise samples from training set,which helps improve the classification effect of SVM.
Keywords/Search Tags:Text Classification, Word Embedding Vectorization, Deep Learning, Support Vector Machine, Sample Reduction
PDF Full Text Request
Related items