| Mankind has entered the post-genome era, to clarify the interaction between genes and the relationship between the rapid rise and become a research hotspot of contemporary life sciences. The study of interactions between genes and the gene regulatory network guess is that genomics is an important goal, after the whole-genome sequencing, showing in front of us is the vast DNA sequence information, how to parse out the encoding of all possible genes and their physiological function, and genome-wide level, of these vast amounts of gene expression data analysis, thus revealing the mystery of life, it will be post-genome era facing humanity of the most challenging biological themes. Microarray analysis can detect gene transcription under different conditions change, it will show reflects the characteristics of tissue types, developmental stage, environmental conditions, response, spectrum of genetic changes in genes. When the chip data appeared in large numbers, creating a new question:If all the available data together, our ability to classify new genes of unknown function to the known functional classification? Can gene expression and gene function to link? Can the discovery of new types of co-regulated genes? Can the data from the chip to draw the full expression of the gene regulatory networks? These problems are usually answered through the calculation. Gene mapping and sequencing of the problems faced by and large-scale gene expression analysis of mathematical problems to be small compared to many. From the characterization of biological systems, the development of the individual components to complete the description of behavior of biological systems up, gene expression analysis of the data remains to be in-depth research and development.In order to obtain meaningful information and interdependence between genes for massive gene expression data, and further for the establishment of a more complex biological networks to provide support, clustering methods have been widely applied to gene expression data analysis field. By cluster analysis can be gene expression data according to gene function, expression, experimental conditions, the specific needs of the expression data to classify, including the horizontal clustering (clustering of genes) and vertical clustering (clustering of the sample), horizontal clustering can have a similar function or gene regulatory mechanisms at the same classification, the vertical cluster can be shared by all the genes expressed by the information classified. Although the gene expression data clustering analysis of biological information research has been the hot spot at home and abroad, but because of the specificity of gene expression data and did not produce a high-performance general-purpose clustering method, clustering results still have a certain biological characteristics of the actual gap between the relevant researchers are exploring gene expression data to fully tap the inherent access to it. The experimental verification, based on similarity measure of the clustering algorithm used measure of the applicability of a stronger correlation, the clustering effect there will be a very significant improvement. Specificity for gene expression data, with the traditional areas of data analysis differ, mainly due to gene expression data has the following characteristics:1. As the experimental design and data collection methods to quantify differences in gene expression data there is data loss, data noise and data is not uniform and other problems.2. Gene expression data is usually time series observation, expression of the value of the various observation point to meet the dependency relationship should at least satisfy the first-order Markov assumptions; 3. Gene expression data contains rich bio-law. Therefore, expression data clustering analysis methods are required to meet the characteristics of biological data reflects the characteristics of biological data, reflecting the inherent laws of bio-run.The article is the timing of gene expression data clustering analysis, we found in the genome there is a significant number of genes in the regulation of the process of significant expression changes do not occur, on the contrary, many gene regulatory mechanism in a number of significant expression changes, but A total of regulating gene expression, in addition to a total of expression (cis-expression), but also including the trans-expression and so on. How to cis-expression and trans-expression of a unified expression to model them, and eliminate obvious expression of genes are the first major content of this article, in this based on the model of gene expression data clustering analysis, and automatically determine the cluster a The second number is the main content of this article. We mainly do these two aspects of the work.Considering the particularity of gene expression data, it is needed to be united that the style of cis-expression and trans-expression. Too many irrelevant genes will seriously affect the quality of clustering, making dilute the characteristics of clustering results, class differences decreased, the classification of genes tend to average. Trough comparing the ratio of the sequence value, eliminate the gene whose expressions are not salience. It is the main content in my paper.On how the expression of cis-and trans-expression of unity among the models, eliminating the expression of genes that is not obvious issue, we pass on the gene expression sequence to make the following transformation processing. Suppose that Oj is a gene expression of the moment in time i level. Calculating the difference of Oj and Oj-1, calledω, using it to describe the trend of gene expression value, obtain new sequence, calledO'=(Oj,O'2,…O;N-1)。For O'=(O'1,O'2,…,O'M-1),we make sequence changing calculating the difference of O'jm and O'jm-1,φj= arctan| O'jm—O'jm-1|, whenφj> 1,ψj= 2; whenφj=1,ψj= 1; whenφj< 1,ψj= 0, create objective sequences those are clustering analyzed, O*= (O*,O*2,…,O*T-1), O*i∈(0,1,2).Eliminating the expression of genes that is not obvious issue:Considering the element of sequence O*, x is the number of the value that is 0, y is the number of the value that is 1, z is the number of the value that is 2. Trough the total of the sequence elements, we know x+y+z= N-l,calculate each ratio of x,y,z and N-1, to be marked that rx= x(N-1), ry=y/(N-1), rz=z/(N-1). R= {rx,ry,rz},R*= max R, when R*≥δ(δis the real number that is closed to 1), called sequence Oj to be the gene that obviously express, to eliminate sequence Oj.To treatment sequence, apply based on hidden Markov model to the clustering of gene expression. The algorithm is as follows:Step 1:The observed values of each gene sequence using Baum-Welch algorithm to train the model, obtaining model parametersλj. Step 2:Calculate the degree the similarity between each of the two classesλi andλj, to find the two classes whose the degree of similarity is the highest.Step 3:The two sequences belongs to the two classesλiandλj, whose the degree of similarity is the highest. Conflate the two sequences into a new sequence. A new model parameters is created by Training model with the new sequence.Step 4:To each of the two classesλiandλi, calculate the degree of similarity, and determine whether to meet the conditions, if met, conflate the two classes and to train a new class A.Step 5:Repeat steps 3 and step 4 until the iteration termination of clustering results obtained.At present, the gene expression data clustering analysis of biological information research at home and abroad is still hot, but because of the specificity of gene expression data and did not produce a high-performance general-purpose clustering method, clustering results still have a certain biological properties with the actual gaps, how to improve the clustering analysis of gene expression data accuracy, to fully reflect and tap the bio-informatics, needs innovative research. |