Pathway is interaction or functional relationship between a series of geneticcompositionandaseriesofmaterialcompositionthatcompleteabiological processincells together. ALLthe chemical reactions in cells are catalyzed by enzymes to carryout, and the reactions catalyzed by enzymes are continuous, that is, the product offormer enzyme is usually the substrate of later enzyme, so the reaction systemcatalyzed continuously by enzymes is called the metabolic pathway. The smallmolecule metabolite of substrate and product in metabolic pathway is the researchobject concerned by metabonomics, also is the component of metabolic network.Signal pathway is the fundamental way to regulate cellular growth,metabolism,cellular differentiation and programmed cellular apoptosis, is the research object ofproteomics. Regulatory pathway is the fundamental component of gene regulatorynetwork, is the research field of transcriptomics. Therefore, pathway is the researchbasisofvariousomicsandsystembiology.Because pathway prediction is the research basis of various omics, the pathwayprediction itself is extremely important, and the breakthrough of related research isadvantageous for many problems to the in-depth study. Taking metabolic pathway asan example, the research of metabolic pathway prediction can be helpful for betterunderstandingthelifeandthebiologyprocessesinthesyntheticsystem;behelpfulforidentify new gene and verify the current existed annotation; if the major metabolicpathway can be known, it will have important meaning for realizing the controlmechanism that dominate the product synthesis,select-breeding the excellentmutation strain and establishing the mathematical model of microbe response rate;different pathway describes different biology processes in organism, and itsinformation is crucial for successfully building the quantitative model of biologysystem.At present, for the study of pathway prediction, so far, although the algorithmresearch about applying bioinformatics methods to pathway prediction is constantlydeepening, but because the regulatory mechanism of biology system is complex, theawareness of this problem is not deep enough, and the research of pathwaypredictionneeds multidisciplinary knowledge, the combination of a number of ways and avarietyofdataintegration,allkindsofalgorithmsarenotperfect,itisstilldifficultfor completing pathway prediction efficiently and accurately. Therefore, providing newideas and solutions for pathway prediction is already imminent. Only throughanalyzingand investigatingthe existingpredictionalgorithms, improvingtheproblem,effectively using of all available data sources, improving the prediction accuracy andsensitivity, it can make the pathwayprediction more accurate, thus contributingto therapid development of related research, as well as the ceaseless progress of thewhole-genomeresearch.Becausethepathwaypredictionmainlyisthefollowingprocess:First,predictthebasic elements participated in pathway, then based on it, determine the relationshipbetween elements using of some methods such as template matching, statistical studyandmapsearchmethodsofpathwaystructure,atlastinformastructuremapincludingall the pathway information. Therefore the recognition of pathway elements has beenoneoftheimportantcontentsaboutpathwayprediction.At present, the majority methods about the recognition of pathway elements(such as the common template matching, BLAST comparison and statistical learningmethods, etc.) are usually taken the enzyme or pathway which has been identified orhas been known as reference. Although pathway reconstruction based on the knownpathway information is an important starting point to explain metabolize ability, butonlyalimitednumberofknownpathway, andpathwayreconstruction oftenmiss alotof enzymes, even in some basic pathway. In addition, if research the same or similarmaterial elements in the reference species from the individual perspective to carryoutthe pathway elements mapping, there is the literature shows that it is difficult tocorrectlypredict the genes that participate in a pathway, even use the genes of similarspecies to compare, this is also the results of species diversity and complexity.Therefore, the use of algorithm which is not based on the known pathway toeffectivelyidentifythemetabolicpathandachieve theautomaticpathwayinferenceisnecessary.The existing methods of pathway prediction which are not dependent on theknown pathway information can be roughly divided into two kinds, one needs theinformation of compounds of the chemical functional group, this is very difficult andcomplex for all genome-wide genes of a species; the other one is the use ofexperiment methods such as gene knockout or RNA interference to make amicro-arraydataforthepathwayprediction,this needstodoalotofmicro-arraydata,andspendingislarge.In order to overcome these inadequacies above, this paper presents a feasiblewaythatistoextractfeatureattributewhichcanbeeasilygainedandbeappliedtothegenes on pathway from every data set, and use the appropriate clustering rules, usesuitable clustering algorithm to achieve the identification of pathway elements, resultingin mass accuracyin the genetic data to find the correspondingpathwaygenecluster.The genes in the same pathway will be co-expression, and each pathwayrepresents an access that achieves a particular function, therefore the functions ofgenes in the same pathway will be similar, so the gene cluster corresponding to apathway is a gene set that has the similar function and gene expression pattern.Therefore, in the process of identification of pathway elements, it can take themicro-array expression data and gene function information as two features, throughthe de-noising processing and formatting to get the unified and effective inputinformationforclustering.For the problem of clustering for gene expression profiling, the realization aboutthe system behavior of gene expression is not comprehensive, and there is no a prioriknowledge of clustering, so it commonly use the unsupervised learning methods. Ingene expression data analysis, hierarchical clustering, K-means, self-organizing mapneural network is commonly used in the application. As a result of the use of themicro-array data as one of the features in this paper, it can consider the use of thesethreekinds of clustering algorithms toidentifypathwayelements. However,it usuallyneed pruning to get the clustering results for the hierarchical clustering method, andthe process of pruning is often a more subjective, which would lead to the loss ofsome important information or include some irrelevant information. In addition, theclusteringresultsofthehierarchicalclusteringarerelatedtotheorderofthevector,soit is considered to be a local optimal solution approach. Therefore, it will no longerconsidertheuseofhierarchicalclusteringmethodtoidentifypathwayelementsinthispaper. By the comparison of the experiment results about K-means clusteringalgorithm and the SOM algorithm, this paper eventually decide to use the SOMalgorithmtoidentifypathwayelements.Theworkofthis paperis takenthemicro-arraydataand genefunctionas featureattributes, and the use of SOM which is an unsupervised clustering algorithm toachieve the identification of pathwayelements and get the gene cluster correspondingto each of the specific pathway. This method is applied to the identification ofpathway genes for budding yeast (Saccharomyces Cerevisiae), through experimentaltesting and analysis proved its effectiveness.Although the overall accuracy rate is notideal, when applied this algorithm to the whole-genome gene of a species, the inputdata will be expanded to a certain size, so there will be an improvement of accuracy.Moreover, such an algorithm can be used for the gene identification of a species inwhich there is not known pathway information, so the accuracy has a good practicalsignificance. In addition, on this basis, through the further integration of relevantinformation, it can use template matching, statistical learning and pathwaystructure map search methods to do the excavation and extraction on the gene cluster which isgained from the SOM clustering, the pathway structure map will gain at last. To sumup,it illuminates that themehodoftheextractionofmicro-arraygeneexpressiondataand functional information as two feature characteristics of species genes and the useof SOM algorithm for clustering is feasible and meaningful for the identification ofpathwayelements. |