Study On Missing Values Imputation And Variables Classification In Metabolomic Data Cleaning

Posted on:2020-11-06

Degree:Master

Type:Thesis

Country:China

Candidate:J H Qin

Full Text:PDF

GTID:2370330572980690

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

Data cleaning is an important step before metabolomics data analysis.A reasonable data cleaning step can make subsequent data analysis work more complete and effective.In this thesis,a new method is proposed for the problem of missing value processing and variable classification in data cleaning:1.In-depth discussion of the missing value patterns in mass spectrometry data,and proposed a new method to generate simulated mass spectral missing data sets,which is simpler and faster than the existing generation methods,and more realistic;New missing value padding method:In the KNN process,the characteristics of grouping information and left truncated normal distribution are utilized,firstly filling the non-random missing,and then using adaptive KNN to fill.In the results of the simulated metabolomics analysis,the improved KNN method can effectively complete the missing value filling,and also contribute to the subsequent analysis.2.A new variable classification method is proposed:D-C method.This variable classification method mainly completes variable classification from two directions:the correlation between independent variables and dependent variables,and the correlation between variables.Variables are divided into three categories:variables with unique characteristics,variables with practical meaning and common variables,and redundant variables.The D-C method mainly uses the Diffreg method and the CMELR-CSIS method,principal component analysis,and correlation analysis to complete the variable classification work.At the same time,this variable classification method is applied to the data cleaning step of multi-source data data processing,and high-dimensional data can be processed efficiently.It is proved by the simulation data and the actual data that the use of this method is helpful for subsequent modeling and other procedures.The application of the two methods mentioned above in the mass spectrum-based metabolomics data cleaning is conducive to the completion of the downstream analysis and provides a new method for the metabolomics data preprocessing.

Keywords/Search Tags:

Metabolomics Data cleaning, Missing Values, Variable Classification

PDF Full Text Request

Related items

1	Classification Of Metabolomics Data And Study Of Variable Selection Methods
2	Research On Classification Method Of Time Series With Massive Missing Data
3	Data mining applications for updating missing values of traffic counts
4	Computational tools for missing values in multivariate longitudinal and clustered data
5	Visualizing gridded data sets with large number of missing values
6	Imputation Methods Of Missing Values For Compositional Data
7	Expectation Estimator In Missing Data
8	Impute Missing Values For Mixed Data
9	Robust model-based analysis of multivariate data with missing values
10	Research On Data Cleaning And Fusion Techniques Of Multi-source Heterogeneous POI Data