Font Size: a A A

Research And Application Of Correlation Algorithm For Massive Data

Posted on:2020-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:J Z XuFull Text:PDF
GTID:2430330575955708Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data is growing faster than ever before and by the year 2020,about 1.7 megabytes of new information will be created every second for every human being on the planet.The detection of the relationship among variables in a big data set is becoming more and more common in the fields of genomics,physics,political science,and economics,which makes the problem of finding the relationship among variables as a growing challenge.In general,we want to be able to find a wide range of relationships between variables where the sample size is large enough,not just specific functions(such as linearity),but all functional relationships.We want to find similar score metrics with the same noise in different types of relationships.In the study of genetic data,the detection of disease-causing genes associated with diseases plays an important role and is a key issue for people.However,most of the existing manual methods are long-term and costly,so it is hoped that the pathogenic genes can be detected by other means.The traditional calculation method has a poor effect on the detection of nonlinear functions.For this reason,this paper will propose a new solution for these two points.The main research results of this paper are:1.The Maximum Information Coefficient(MIC)is an effective tool for exploring data relationships.The MIC exhausts all partitioning methods when dividing variables into meshes,and this process determines the computational complexity of the MIC algorithm in large data sets.We propose a new approximation algorithm to make a significant improvement in the application of MIC in big data sets,called CDMIC(using the largest information coefficient of the Cluster Division).First,we use the fast clustering method to generate the central nodes of similar data and use the central nodes to represent closely related points to form clusters.Second,calculate the MIC value for each region.Finally,we use the sum of the weighted MIC values.The experimental results show that CDMIC retains the advantages of MIC and can accurately identify the existence of related data pairs.The CDMIC algorithm is far superior to the MIC algorithm in terms of time efficiency.This method can be used for the detection of pathogenic genes.2.The traditional method for detecting pathogenic genes is linear regression,but the linear regression method has poor detection performance under the nonlinear function,while the nonlinear regression method has long detection time.This paper integrates the advantages of the two methods,proposes a stepwise nonlinear regression model based on bagging,and uses the LARS algorithm to get the results quickly.Finally,the correctness and time efficiency of the algorithm is verified.The experimental results show that compared with the linear regression model used in the traditional method,the bag-based stepwise nonlinear regression model proposed in this paper has better a effect on the genetic data.
Keywords/Search Tags:Correlation coefficient, maximum information coefficient, linear correlation, nonlinear correlation, least angle regression
PDF Full Text Request
Related items