| Disease gene prediction and identification are of great significance for the treatment of genetic disorders. In recent years, with the development of high-throughput sequencing technologies for gene research has brought new opportunities and emerged plenty of disease-causing genes methods. These methods are based on existing disease or causal relationship between phenotype and gene, using the network method to sort the genes, so as to achieve the goal of disease-causing genes. Existing methods are based on a premise that causing the same disease or similar diseases in biological networks of genes in the biological network are close or interacts with each other, in other words, there is a module attribute. But the adjacency matrix of the biological entity’s network established by the existing methods is a more rough, which have a correlation relationship between genes and the value is 1, otherwise is 0. Relationship between genes cannot be reasonably quantified. In addition, high-throughput sequencing technology has produced a large number of biological data, making the integrated data analysis become the main means of disease gene prediction and identification. However, most of these methods are based on biological characteristic construction of local information for an entity, there is no better expanding use of physical network topology. In this paper, the work of this paper is the following:Firstly, this article presents methods to study the relationship between different biological entities from the point of view of statistics, by analyzing the distribution of biological data, to quantify the relative importance of biological entities. So this subject introduced two statistics features to quantify the relationship between genes. One is based on the correlation coefficient for gene expression data, analyzing the gene function or regulation on the importance of the entire gene networks; another was based on divergence information of gene expression data feature vector, using the gene expression values as a probability of gene expression to quantify gene expression status, to measure the relative importance between genes; compare these two statistical characteristics with protein interaction network data, and experiments show that the AUC(Area Under Curve) and top 1 and top 50 of both statistical characteristics in the prediction of potential disease-genes better than the prediction of the protein interaction network data, verified that these two statistical characteristics in pathogenicity-related gene prioritization are valid.Secondly, we present a random walk algorithm of the binary regression model to predict pathogenicity-related gene. Using random walk models to construct the feature vector for each gene by prioritizing the associated genes and select the top k gene for gene feature vector of F1, F2, and F3 construction. The top k genes are strongly associated with this gene in a global perspective, and then collecting the global information of genes by the weight of label 1 and label 0, or the numbers of label 1 and label 0 to construct a gene vector.Thirdly, in the characteristic F1, the AUC results in three different types of biological data network: protein interaction networks, gene co-expression network and gene pathway network in this chapter significantly better than the results of others characteristics "PCF1", MRF and RWR algorithms. Under F2 characteristics, three different biological data networks, the AUC results of our method are higher than other characteristics of "PCF2". In the characteristic F3, integrating these three networks, the AUC result of the method presented in this chapter significantly outperforms the result of MRF algorithm, RWR algorithm, DIR algorithm and others characteristics "PCF3". In addition, the comparison of the different algorithms is carried out from the time efficiency, which proves that the algorithm is more competitive. |