Font Size: a A A

Study On Missing Values Imputation And Variables Classification In Metabolomic Data Cleaning

Posted on:2020-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:J H QinFull Text:PDF
GTID:2370330572980690Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Data cleaning is an important step before metabolomics data analysis.A reasonable data cleaning step can make subsequent data analysis work more complete and effective.In this thesis,a new method is proposed for the problem of missing value processing and variable classification in data cleaning:1.In-depth discussion of the missing value patterns in mass spectrometry data,and proposed a new method to generate simulated mass spectral missing data sets,which is simpler and faster than the existing generation methods,and more realistic;New missing value padding method:In the KNN process,the characteristics of grouping information and left truncated normal distribution are utilized,firstly filling the non-random missing,and then using adaptive KNN to fill.In the results of the simulated metabolomics analysis,the improved KNN method can effectively complete the missing value filling,and also contribute to the subsequent analysis.2.A new variable classification method is proposed:D-C method.This variable classification method mainly completes variable classification from two directions:the correlation between independent variables and dependent variables,and the correlation between variables.Variables are divided into three categories:variables with unique characteristics,variables with practical meaning and common variables,and redundant variables.The D-C method mainly uses the Diffreg method and the CMELR-CSIS method,principal component analysis,and correlation analysis to complete the variable classification work.At the same time,this variable classification method is applied to the data cleaning step of multi-source data data processing,and high-dimensional data can be processed efficiently.It is proved by the simulation data and the actual data that the use of this method is helpful for subsequent modeling and other procedures.The application of the two methods mentioned above in the mass spectrum-based metabolomics data cleaning is conducive to the completion of the downstream analysis and provides a new method for the metabolomics data preprocessing.
Keywords/Search Tags:Metabolomics Data cleaning, Missing Values, Variable Classification
PDF Full Text Request
Related items