Font Size: a A A

High-dimensional And Sparse Data Classification Based On Deep Learning

Posted on:2020-09-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Y JiangFull Text:PDF
GTID:1368330602955533Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
Internet big data contains a large amount of text,how to effectively manage and utilize these data is a research hotspot of information science.At the same time,with the continuous advancement of high-throughput experimental techniques,bio-group data has exploded.Disease characterization based on omics data is a hot topic in biomedical research.Text and metabolomics data,although of different origins,are characterized by high dimensionality and sparsity.Traditional machine learning methods often fail to achieve satisfactory results due to dimensionality catastrophe when solving high-dimensional sparse matrix computing problems.This dissertation proposes a high-dimensional sparse data classification method based on deep learning,focusing on the application of deep learning in text and metabolomics data classification.The specific research work is as follows:(1)For high-dimensional sparse text data,a text classification method combining Deep Belief Networks(DBN)and softmax classifier is proposed.In this method,DBN is used to reduce dimensionality of high-dimensional and sparse text data,and softmax classifier implements classification of dimensionality-reducing data.In the pre-training process,DBN and softmax respectively complete their respective work;in the fine-tuning phase,we consider the two as a whole,and introduce the Limited memory Broyden Fletcher Goldfarb Shanno algorithm(L-BFGS)to adjust the system model parameters.Experiments on the Reuters-21578 and 20-Newsgroup datasets show that the proposed methods can converge in the fine-tuning phase for text data of different scales,and the effect of text categorization is significantly better than the K-Nearest Neighbor algorithm(KNN)and Support Vector Machine(SVM)algorithms.(2)For the metabolomics data of breast hyperplasia with high dimensional sparsity and small sample characteristics,this dissertation proposes a DBN and softmax classification model that combines the dropout strategy.In the model training process,the DBN pre-training is first completed by using unlabeled data,and the L-BFGS is used to complete the fine-tuning of the system model.At the same time,in order to avoid over-fitting as much as possible,the dropout method is introduced in the pre-training and fine-tuning process.During the experiment,the results of five-fold cross-validation and datasets of different scales show that the proposed classification method is better than KNN,SVM and Back Propagation Neural Network(BPNN),and the classification results are stable.(3)This dissertation introduces a classification study of expanded cardiomyopathy metabolomics data based on Stacked Auto Encoder(SAE)and SVM.Because of their small sample size,high-dimensional,nonlinear and noisy parameters,traditional feature extractions and classifications are very difficult to achieve satisfactory results.SAE performs non-linear transformations with hidden layers,which can learn complex relationships.It has a strong ability to represent high-order features,and can extract more complex features of metabolomic data.Experimental results on real metabolomics data of dilated cardiomyopathy demonstrate that the proposed model obtains better performance compared to other existing algorithms.
Keywords/Search Tags:deep belief networks, stacked auto encoder, text data, metabolomics data, high-dimensional and sparse
PDF Full Text Request
Related items