Font Size: a A A

Research Of Association Rules Mining For Cancer Gene Data

Posted on:2020-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:G J MaFull Text:PDF
GTID:2404330578956073Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The continuous development of science and technology have promoted rapid reforms of all walks of life,especially in the field of biology.The success of whole-genome sequencing has greatly reduced the cost of obtaining cancer gene expression data,and providing a broad platform for systematic research of cancer genomes.In the face of the characteristics of high dimensions,small sample and large signal-to-noise ratio of cancer gene expression,how to mine valuable information in such data sets is a hot topic.Currently,researchers from all over the world have made some achievements in the study of cancer gene expression data.However,due to the immature technical route,the research on gene expression data is not thorough enough.These results cannot be used in clinical medicine,so the research still requires large-scale validation.Association rules is one of the most practical methods among various data mining.Apriori association rule mining algorithm is one of the classical algorithms in data mining,and its defects are very obvious.For high-dimensional big data,when generating frequent itemsets and candidates,Apriori algorithm needs to scan the database repeatedly.The candidates not only occupy storage space,have many irrelevant items.This not only reduces the accuracy of the algorithm,but also consumes time.Thus the paper proposes a new association rule mining algorithm based on PmR-NRS hybrid feature selection.The algorithm mainly uses the PmR-NRS method to maximize the correlation between features and categories in the dataset with minimal redundancy between features.The features are extracted and the optimized feature subset is retained for mining association rules.In order to verify the effectiveness of the proposed algorithm,the traditional Apriori algorithm and the improved Apriori algorithm are applied to four groups of cancer gene expression datasets.The results show that PmR-NRS hybrid feature selection has a great effect on mining association rules,which reduces the computational complexity of Apriori algorithm and improves the effectiveness of the algorithm.For research of association clustering algorithm,we mainly introduce WPFCM algorithm.Since this algorithm is only suitable for low-dimensional data sets,the cancer gene data sets we studied have more than 20000 dimensions.Therefore,we propose a new improved QR-WPFCM algorithm is proposed.The main idea of the algorithm is to decompose the high-dimensional data and then perform cluster analysis.In order to verify the accuracy of the QR-WPFCM clustering algorithm,we have chosen two sets of classic cancer datasets: the Leukaemia dataset published by Golub and the Colon carcinoma dataset of the GEO gene database.It is also proved by The experiments indicate that the accuracy of the QR-WPFCM clustering algorithm can reach 100% after selecting the appropriate clustering center,while the accuracy of the traditional WPFCM clustering algorithm can only reach 93.1%.Finally,the clustered cancer dataset of association rules mining,the results show that the QR-WPFCM clustering association rule algorithm has great potential and applicability for predicting cancer gene markers.
Keywords/Search Tags:Cancer Gene Expression Data, Apriori Algorithm, Feature Selection, Association Clustering, Genetic Markers
PDF Full Text Request
Related items