Font Size: a A A

Researches On Haplotype Assembly Based On Semi-supervised Learning

Posted on:2021-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LiFull Text:PDF
GTID:2428330611460714Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The goal of haplotype assembly is to reconstruct the haplotypes of an individual based on aligned DNA fragments,there are many algorithms and models are proposed for solving this problem.With the completion of the Human Genome Project(HGP),people have recognized that the differences in genetic sequences among individuals are the main causes of various phenotypic differences,such as hair color,shapes,and the different risk levels of suffering from illness.If we can obtain the correct and integral DNA sequence only by biological method,haplotype assembly problem is easy to solved.However,in the practical situation,due to the limitation of biotechnology,we can only obtain short DNA fragments which have errors inevitably.Therefore,it is a goal and also a challenge to assembly the haplotypes of individual with the least errors or even correctly based on DNA fragments with sequencing errors.The semi-supervised learning strategy in machine learning filed is a learning model which use unclassified labeled data scientifically.Since the labeled data is hard to obtained,it is necessary to use a small number of labeled data and a large number of unlabeled data to improve the performance of learning model.Based on above,this paper proposes two methods for haplotype assembly based on semi-supervised strategy.K-Means clustering is a classical cluster algorithm in machine learning.K-Means continuously partitions and update the center of clustering during its iterative process for an optimal clustering result.This paper proposes a haplotype assembly method SKMEANS which based on semi-supervised learning and improved K-Means model.SKMEANS uses the fragments which obtain classification in preprocess phase to construct the initial center of clustering.And then,continuously partitions fragments by comparing the distance between the fragments and the center of the clustering,update the center of clustering after each partition.This phase is repeat until the centers don't change anymore,and after this phase,the two centers are considered as optimal haplotypes,and we can construct haplotypes by two centers.K-Nearest Neighbors(KNN)is one of the most basic classification algorithm in machine learning.The fundamental of KNN is:in the feature space,the classification of the sample is same as the classification which most of K nearest neighbors belong to.This paper proposes a haplotype assembly method SKNN which based on semi-supervised learning and improved KNN model.The data preprocess phase of SKNN is similar to SKMEANS,we infer the classification of some fragments based on its characteristics and use these data to construct the initial model.We use the initial model to classify the rest fragments,if the classified fragment has high confidence,we then put it into Training set to optimize model.After the model is optimized,we reclassify the fragments with low confidence.Once we get two disjoint sets after all fragments are classified,we can deduce each haplotype by the overlapping sites in each set.In experiment,we use both simulated datasets and real datasets to test SKMEANS and SKNN,and compare with two popular algorithms ProbHap and PEATH.The experimental result shows that SKMEANS and SKNN is feasible and effective,and compare with other two algorithms,SKMEANS and SKNN can solve the problem more accurately,and in real datasets experiments,SKMEANS can solve problem with shorter running time.
Keywords/Search Tags:single nucleotide polymorphism, haplotype assembly, semi-supervised learning, k-nearest neighbors, k-means
PDF Full Text Request
Related items