| With the rapid development of DNA sequencing technology, the research of whole genome sequencing and analysis on plant is also gradually carried out. People have completed many projects on plant, such as model organism Arabidopsis thaliana, crops Oryza sativa, Zea mays and other plants. But the life cycle of the plant is relatively long. Most plants are polyploid organisms. Plant genome is usually relatively large. The heterozygosity of the genome is relatively high. The genome itself contains numerous repeat sequences. These characters have become the bottleneck which limits the development of genome sequencing on plant. In the first decade of the twenty-first century, people have developed many whole genome sequencing strategies, during which the BAC-end sequencing method and the BAC-pooling strategies are relatively new strategies and can be a good solution to the problems in genome sequencing on plant. By now, many genome sequencing projects on plant and other related work about analysis have been completed with the application of these two strategies, such as Zea mays, Barley and Camellia sinensis. Because the BAC-pooling strategy is a relatively new one, it is an important topic how to carry out the BAC-pooling sequencing more effectively. This article presents the real problems through simulation experiment, and provides suggestions to solve these problems by analysis, which will take more advantage of the BAC-pooling strategy.Cassava is one of the most important tropical plants. A series of research on cassava have been carried out. People have built a genome sequencing project of cassava, and have issued a number of genome sequencing information and protein annotation information in recent years. The work of this article is based on20000cassava BAC clone libraries. We obtain97Gb sequencing data of cassava BAC genome in total through the second generation sequencing technology. By random sampling method, we simulate four specific parameters, which are sequencing coverage, read length, paired and unpaired data, and K-mer, of the assembly process from the real BAC data, and perform systematic analyses. Through simulation experiments and analysis, we draw some conclusions. Sequencing coverage which should be at least15-20times can meet the requirements of the sequencing, and more data can provide better results of scaffold sequence assembly. Before the condition of sequencing coverage is satisfied, longer sequencing read length can contribute to the sequence assembly and the gap fill. Even satisfied the sequencing coverage, we should sequence longer read length data. The unpaired data, from data preprocessing, should be added to the sequence assembly, which may not only maximize the use of sequencing resources, but also can make the results more complete. The choice of K-mer should be based on the sequencing data. The best K-mer can make the N50better. In addition, this article carries out the sequence assembly of cassava BAC genome, and obtains113232results of sequence, which total length is153.85Mb, accounting for20%of the total genome size, and the GC content of all results is36.26%. At the same time, we also blend long insert library which sequence for whole genome to sequence assembly. When we compare two kinds of sequence assembly results, we find that more longer results can be obtained with long insert library than without. These results which with long insert library can provide more long results for the genome annotation, and it is easier to increase the integrity of the whole genome assembly. Then based on the BAC genome sequence assembly results, we annotate the genome and obtain9054protein annotation information and1387RNA annotations.8119proteins contain1020known functional domains. In comparison with the published cassava protein annotation results, this article has57%notes to the proteins which are highly homologous to the compared results. These compared protein annotation information can be considered as a comment to the protein annotation results which have been published.The optimal BAC-pooling strategy is established, and a series of work is carried out which including cassava BAC genome assembly, genome annotation, and preliminary functional analysis in this article. |