Font Size: a A A

Biclustering Analysis On Loss Of Heterozygosity Data Of Lung Carcinomas Samples

Posted on:2008-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:L S LiuFull Text:PDF
GTID:2144360212996832Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the accomplishment of Human Genome Project (HGP) and Hapmap project, the research on human genome to prevent and cure diseases is developing. Human lung carcinoma is a deadly disease with high lethiferous ratio. This paper is the biclustering analysis on the samples with different kinds of lung carcinoma and the normal samples, which found a new research method and approach for this kind of cancer genome research.In this paper, the concepts and research meanings of SNPs (single nucleotide polymorphisms) and LOH (loss of heterozygosity) are firstly introduced in the view of bioinformatics. In the base of these, the whole work is built, which means that taking SNP chips as objects, using dchip software to extract LOH data and then directly finding the lung carcinoma"restrain tumor gene"and the people who are prone to suffer this cancer. Then the biclustering method which is adopted in this paper is presented. It is a new data mining method which is put forward in 2000 for gene expression data. As the development of it, there have been tens of biclustering methods which are different in the sides of type, structure, algorithm, and complexity. They are classified in the whole to clear about the trait of dealing problems of biclustering. Cheng & Church algorithm is realized in this paper and the concept, theory, details of realization and flow chart are all clarified afterwards.Finding the loci which are LOH (loss of heterozygosity) is an important means to locate restrain tumor gene. The problems on these genotypes will lead kinds of lung carcinoma on phenotypes. The research on LOH adopts short tandem repeat (STR) at present which is the second era genetic marker. It detects the change of chromosome or gene morphologically which has the disadvantage of high mutation ratio and low accuracy. This paper takes theSNP loci as the row vectors of array to deal with which is the method of identifying tumor genetic variation at molecule level: a SNP is the variation of single nucleotide in the certain position of DNA coding that occurs in more than 1 percent of genetic groups. SNP is the most abundant genetic variation of human as the difference of single nucleotide in human group. It has the merits of huge amounts, high consistency, wide distributing and low mutation ratio. It has replaced STR to be the third genetic marker. As the accomplishment of Hapmap project and maturity of gene chip technology, it is possible to take SNP chips instead of gene expression chips to have analysis which can make the research more efficient and effective.The research method in this paper is bichustering which has brought out a whole new research view of extracting information from expression chips. It ingeniously integrates the rows and lines in the array and cluster them at the same time to gain the sub array which is a subset containing both rows and lines. It is a new clustering pattern that aims at biological chips. It overcomes the problems of traditional clustering methods like noisy signals, forcibly partition of interrelated biological system and loss of much local information. The array in this paper is the result of sample expression which is calculated with dchip software and then trimmed. The rows are 62982 SNP loci, lines are 113 samples, and expressions are the scores calculated with Hiden markov model (HMM) in dchip software. The 62982 SNP loci cover the human 23 pairs of chromosome and the 113 samples are composed of 101 lung carcinoma patients and 12 normal ones. The range of sore is from 0 to 1 which bound at 0.5 to judge whether there is LOH. To give prominence to the LOH loci, make randomization at other loci (whose sores are under 0.5) which better show the rule of LOH loci.The biclustering algorithm introduces the concept of"residue score", divide every projected value into three independent parts: background value, row effect and line effect. The residue score expression is formed with the score value and the three above. The offset is the score function composed with the average square residue score. The algorithm tries to find the sub arrays withsmall offset. The sub array with zero offset has the totally same trend of expression model. So the smaller the offset H value is, the closer the expression model is.Cheng & Church algorithm use greedy search to find low H value arrays. In the beginning, initialize a cluster containing the objects of all the data sets. Later use an iteration to delete the rows or columns in the cluster to reduce the H value. When the H of cluster is lower than threshold delta, the cluster is the result. Concretely, single node deletion is to search the rows or columns with maximal H value and to delete them. Then it will readjust the whole array and rejudge until the H value of target array is lower than the threshold. To improve its efficiency, multiple nodes deletion will adopt to delete a set of rows or columns instead of deleting one at once. It will delete the rows and columns with big H value and won't readjust during the deletion. This paper combines the multiple nodes deletion and single node deletion to increase the efficiency.To assure the size of result most, choose the rows or columns with smaller H in the result of the first phase. Add them to the result array. This paper will not tentatively add all the rows and only add the rows that satisfy certain qualifications. As to the columns, it should content that when certain columns add to original array, the H value of new array must be larger than that of original one. Repeat the process and only assure H(I,J)<δ. In this way, the target of the algorithm is realized: finding the biggest size sub array in all the sub arrays whose H value is lower than delta.The main work of this thesis: one is getting the sores of 113 samples at 62982 SNP loci utilizing dchip software, analysis them and trim them as a big array. The other is realizing the Cheng & Church algorithm, applying it to the array stated above. It builds a new method of lung carcinoma LOH research and makes it possible to find samples and SNP loci which are prone to suffer from the disease. In the end, the results are compared with the real samples characteristics and then proved effectively.
Keywords/Search Tags:Heterozygosity
PDF Full Text Request
Related items