With more and more mature sequencing technologies,humans have obtained massive amounts of protein sequence data,the corresponding structural and functional information is still difficult to obtain.Predicting the homology relationship of protein sequences with unknown structures by searching for homology protein sequences from protein databases with known structures is an important part of predicting protein structure and function.However,for the problem of protein remote homology detection,it is difficult for existing methods to obtain accurate prediction results because the similarity of remote homology protein sequences is generally less than 30%.Therefore,it is a challenging task to improve the prediction performance of sequence search methods for protein remote homology de-tection.In this dissertation,non-homologous noise and weak detecting ability for remote ho-mology proteins are studied.And based on the construction of fusion of diverse sequence similarity features,many protein sequence search methods are proposed to improve the performance of protein remote homology detection.The main research contents of this dissertation include the following four aspects.As the specific executor of life activities,analyzing the spatial structure of a protein is an important way to understand its function,but the analysis of protein structure still relies heavily on expensive infrastructure represented by cryo-electron microscopy.With more and more mature sequencing technologies,humans have obtained massive amounts of protein sequence data.How to predict the protein structure and function information through the sequence information is a biological challenge is a major biological challenge.Protein sequence search methods have become an important manner for protein structure and function prediction by searching homologous information from the protein databases with known structures.However,these search methods still have serious shortcomings in the problem of protein remote homologous detection with low sequence similarity.To address the problem that the sequence profile with non-homologous noise reduces the performance of homology search methods,a supervised supervised-manner-based it-erative search framework(SMI-search)is proposed.Firstly,three types of incorrectly selected homology errors are proposed based on the analysis of non-homologous noise.Secondly,the incorrect selection problem correction module and the search result opti-mization module are proposed based on the learning to rank technology and feature simi-larities.Finally,these two modules are embedded into the iteration alignment framework.On the benchmark dataset,the performance of basic methods for protein remote homol-ogous detection is significantly improved by SMI-search.It indicates that SMI-search solves the impact of non-homologous noise and shows better generality.For the performance bottleneck when the sequence search methods search remote ho-mologous proteins,a profile-link-based search method(PL-search)is proposed.Firstly,based on the asymmetry of iteration alignment methods,a double-link strategy is proposed to solve the problem of non-homologous noise.Secondly,an iterative extending-link strat-egy is proposed to enhance the ability to capture remote homologous protein sequences.Finally,a two-level Jaccard distance is proposed based on sequence profile alignment.On the benchmark dataset,the PL-search method can not only obviously improve the detecting ability for remote homologous proteins,but also improve the ranking quality of detected results when applying it to basic search methods.It indicates that the profile-link-based search method effectively replaces the sequence alignment methods when searching more remote homologous protein sequences with lower sequence similarity.For the problem that it is difficult for protein sequence search methods to balance the number of detected homologous proteins and ranking quality,a two-layer search frame-work(S2L-search)for protein remote homology detection is proposed to furhter improve the ranking quality by filtering strategy and re-ranking strategy.In this study,filtering strategy and re-ranking strategy are used to solve the problems of non-homologous pro-teins and ranking quality.In the filter strategy,the SMI-search’s ability for incorrect selection problems and the denoising ability of PL-search for non-homologous protein sequences are fused.In the re-ranking strategy,a ranking learning model is constructed based on the complementarity of the sequence similarity features in SMI-search and PL-search,thereby improving the ranking quality of the detected results.The experimental results show that the S2L-search framework can improve the detecting ability for remote homologous proteins and the ranking quality.It indicates that the S2L-search framework effectively integrates the complementarity between different similarity features to improv-ing the ability to discriminate remote homologous proteins.In order to solve the problem that the diversity hierarchical structure relationships be-tween the query protein and candidate proteins are ignored by previous studies,a search method based on the predicted protein hierarchical relationships(PHR-search)is pro-posed.In the PHR-search framework,the superfamily level prediction information is obtained by extracting the local and global features of Hidden Markov Model(HMM)profile through a convolution neural network,and it is converted to the fold level and class level prediction information according to the predicted hierarchical relationships of SCOPe.And the similarity features are calculated based on the predicted information and various similarity calculation method.Based on these similarity features,filtering strategy and re-ranking strategy are used to construct the two-level search of PHR-search.The ex-perimental results on the SCOPe benchmark show that PHR-search enhances the ability to distinguish non-homologous proteins in the detected results,leading to the improve-ment in the ranking quality.Furthermore,PHR-search shows strong generality when it is successfully applied to different basic search methods. |