Font Size: a A A

Research And Implementation Of RCNA Identification Based On K-means Clustering

Posted on:2016-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhaoFull Text:PDF
GTID:2348330488474523Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene copy number refers to the number of a particular gene or DNA sequence of a certain region in an organism's genome. Gene copy number variation means and compared to the reference genome, DNA deletions within 1Kb to 1Mb or add a range of structural variation phenomenon. Gene copy number mutations(Copy Number Aberration, CNA) are ubiquitous in the genome of a structural variation, including the number of copies of the gene deletion, insertion, inversion, rearrangement and gene complex than point mutation. As for the study of gene copy number mutations CNA, we can have a whole new view of the structure of the genome, the genetic difference between human and pathogenic genetic factors will be. RCNA(Recurrent CNA) is included in a plurality of samples within the same region of chromosome period of continuous CNA, it exists, and many diseases are associated. For RCNA identification can provide important insights and solutions for the study of the molecular mechanisms of disease genes.This article aims to dig out RCNA region associated with the disease from high-throughput biological data in the calculation and evaluation of RCNA excavated area, provide the foundation and basis for the study of pathogenic organisms RCNA region.Through analysis the RCNA region, we can learn the clustering properties of genes RCNA region. According to this feature, we propose RCNA recognition algorithm based on k-means clustering. During clustering analysis, the RCNA region as a class, the remaining data as another class. Because of raw data has noise, in order to effectively identify RCNA region, first we use Wiener Filtering algorithm for removing noise contained in the data, then analysis this data. For the analysis of the data, we start from the first column to select the data, and then the selected data in k-means clustering analysis. Then the window width starting position to move forward a list, select the specified window width data analyzed again. In order to make the results more accurate data for each of the selected area require multiple k-means clustering analysis, and finally obtain the minimum distance of each sample point to the center of many such clusters clustering results. Through the center of the minimum distance clustering for analysis, can effectively identify the data that exist RCNA area.In this article, all experiments were performed on simulated data sets, and by experimental verification of the feasibility of the algorithm. The experimental results with other existing two RCNA recognition algorithm comparison and analysis show that the algorithm during the recognition process RCNA has a better performance.
Keywords/Search Tags:copy number aberration, recurrent copy number aberration, k-means clustering
PDF Full Text Request
Related items