Font Size: a A A

Research And Implementation Of Health Big Data Preprocessing Methods

Posted on:2019-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ChenFull Text:PDF
GTID:2348330569995540Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of computer science and information technology,human society has gradually entered the Internet and big data era.In medical and health industry,the integration and re-adjustment of existing resources can be realized with the support of big data technology to improve the efficiency of industry operation,and tap the huge potential of the industry.However,medical data in the new era has characteristics of mass,high latitude,complex structure,and complicated information,which is not conducive to direct analysis of health data.Pre-processing of health data can enhance the quality of data sets and reduce data size to improve the efficiency and accuracy of data analysis.This article aims at analyzing and improving the algorithm for duplicate data cleaning,outlier data detection and data reduction of health data set in the preprocessing process,on a basis of the existing pretreatment technology.The main work is as follows:(1)Research and improvement of duplicate data cleaning technology for healthy big data.First,analyze and study the existing technology of duplicate data cleaning,and then propose an appropriate pretreatment scheme for the health data set.This paper focuses on analyzing the structure and characteristics of the prefix tree,and improving it according to the characteristics of medical data and then using the improved prefix tree in the duplicate data cleaning of medical data.Traditional algorithms have low detection accuracy and algorithm execution efficiency when the amount of data is large.The duplicate data cleaning technology based on improved prefix tree can effectively solve this problem.The larger the data set is,the more obvious advantages it has over traditional algorithms.(2)Research and improvement of outlier data detection methods for healthy big data.First,study the existing detection algorithm of outlier data with the focus on the density-based outlier detection algorithm.As the detection algorithm of density-based outlier is insensitive to global outliers and can't adapt to the rapid growth of data volume of healthy datasets,this paper proposes a global isolated point detection algorithm based on voting strategies and an improved algorithm that reduces the complexity of algorithm time by introducing clustering ideas.Experiments have shown that the improved algorithm can make better processing of health datasets and give better performance in improving the efficiency of algorithm execution and the comprehensiveness of outlier detection.(3)Research and improvement of data reduction algorithm for health data.This paper mainly analyzes and studies the feature selection algorithm with the focus on the feature selection algorithm based on random forest.When calculating the importance of features,the original algorithm ignores the relationship between the importance of each feature represented in a single tree,so this paper proposes a method of calculating the importance of features based on local importance in this paper.Through experimental analysis and comparison,it has been proved that the improved algorithm can select better feature subsets and improve the performance of the classification model.(4)Design and implementation of a health dataset preprocessing system.The liver disease data set was applied to the improved preprocessing algorithm to more intuitively demonstrate the effectiveness of the improved preprocessing algorithm.Through experiments,it has been proved that the use of an improved preprocessing-related algorithm can more effectively improve the quality of the health dataset and further improve the performance of the data analysis model.
Keywords/Search Tags:health data, preprocessing, duplicate data cleaning, isolated point detection, feature selection
PDF Full Text Request
Related items