Text Classification Algorithm Based On Deep Learning And Support Vector Machine

Posted on:2022-11-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Shi

Full Text:PDF

GTID:2518306764968409

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

In the era of big data,information is growing explosively.How to mine valuable information from massive internet texts and realize efficient management and utilization of texts is a very challenging problem.Under this background,natural language processing(NLP)technology came into being.As one of the important research directions in NLP,the challenges faced by text classification technology mainly include high-dimensional sparsity of text representation,classifier efficiency and sample noise.This thesis studies the above problems.The research contents and main contributions are as follows:(1)This thesis introduces Word2Vec and Glove word embedding theory from deep learning field,and combines with traditional feature extraction methods to carry out word embedding vectorization text representation.The results of text classification under dif-ferent text feature extraction methods,word embedding methods and text vector dimen-sions is explored.From the perspective of feature extraction,the average word embedding weighting is better than the word embedding weighting based on TF-IDF.The word em-bedding vectorization methods based on deep learning is generally better than traditional feature extraction methods,not only the F₁score is higher,but also the average classifier training time is only 1.59%of the traditional methods.From the perspective of word em-bedding methods,Glove word embedding is better than Word2Vec in most cases.From the perspective of text vector dimension,the corresponding classification performance will improve with the increase of the text representation dimension.(2)A parallel data geometric analysis(PDGA)algorithm is proposed.The algorithm is used to finely select training samples for Support Vector Machine(SVM),so as to realize sample reduction,and then maximize the training speed while maintaining the classifica-tion performance of SVM.The rationality and validity of PDGA algorithm are verified on 2 text datasets,4 UCI non-text datasets and 1 artificial dataset.The results show that PDGA algorithm has the fastest execution speed and better classification performance retention ability among the same algorithms.(3)A multivariate outlier recognition method based on Mahalanobis distance com-bined with quantile idea is proposed.This method is designed to alleviate the problem of sample noise.This method is executed in the process of executing PDGA algorithm,which makes full use of the statistical information of the algorithm.Experimental results show that the proposed algorithm will not increase the time complexity and memory bur-den of PDGA algorithm,and can reasonably remove the noise samples from training set,which helps improve the classification effect of SVM.

Keywords/Search Tags:

Text Classification, Word Embedding Vectorization, Deep Learning, Support Vector Machine, Sample Reduction

PDF Full Text Request

Related items

1	The Study Of Classification Methods And Its Applications In Web Mining Based On Statistical Learning
2	Research On Text Classification Algorithm Based On Support Vector Machine And Neural Network
3	Lstm Based Short Message Service(SMS) Modeling For Spam Classification
4	Research On Chinese Text Classification Based On Deep Learning
5	Research On Text Classification Method Based On Convolutional Neural Network
6	Research On Chinese Text Categorization Based On The Integrated Support Vector Machine Method
7	Combining Topic Model And Word Embedding For Short-Text Classification
8	Research On Community Content Classification Method Based On Machine Learning
9	Research On Text Classification System Based On Support Vector Machine
10	Research On Chinese Text Classification Based On Semantic Analysis