| Gene,a biological concept,refers to all the nucleotide sequences required to produce a polypeptide chain or functional RNA.Genes support the basic structure and performance of life,and store all the genetic information of race,traits,gestation,growth,and apoptosis of living organisms.Genome refers to the sum of all genetic material in an organism,including a complete set of genes.The proposal and development of genomics has subverted traditional biological research,the number of genome-related projects has increased rapidly,and technological research has advanced by leaps and bounds.Genome matching refers to obtaining a large number of short genome sequences through genome sequencing,and then using algorithms to match the target genome sequence,and finally splicing the target genome.In the process of sequence matching and splicing,the overlapping information between sequences can be used for matching,that is,when the end of a sequence overlaps with the beginning of a sequence for a certain length,the two sequences can be spliced into a long sequence;it can also avoid overlap between sequences,precisely matched to a specific location in the genome.Genome matching can help scientists obtain the genetic material of species,predict the expression of biological traits,ascertain the mechanism of body regulation and synthesize important proteins,etc.Genome matching is one of the most basic and important research directions in the field of genomics,and it is also the only way to develop modern bioinformatics.At present,the short sequences required for genome matching rely on gene sequencing,but because the current sequencing technology is not perfect,there are great limitations,resulting in high complexity,large calculations,and low accuracy in genome matching.In order to study the complexity and algorithms for the genome matching problem,this paper introduces a one-sided Minimum Common String Partitioning problem(MCSP)to simulate genome matching,and defines the Overlapping Block Cover problem(OBC)and the Exact Block Cover problem(EBC).The input of the Overlapping Block Cover problem and the Exact Block Cover problem is the partition P of the string A and the string B,and the output is a permutation Q containing the character blocks in the partition P,so that Q can cover all the characters in the string B.In genome matching,the string A is regarded as the genome to be sequenced,the division P of the string A is regarded as a large number of short sequences generated by sequencing,the string B is regarded as the genome to be matched,and the elements in the division P are called character blocks.If the number of characters that appear in strings is limited to c,which means the character set size is c,and the number of occurrences of each character is not limited,it is called OBCc and EBCc problems;if the number of occurrences of each character in strings is limited to the maximum d times,it is called d-OBC and d-EBC problems.This paper first defines the Overlapping Block Cover problem,studies the complexity of the OBC3 problem,proves that it belongs to the NP-complete problem,and designs the exact algorithm for the 2-OBC problem which time complexity is O(n2)and approximation algorithm for general OBC problems with approximate ratio less than or equal to 2 and time complexity of O(n2m2).Then the Exact Block Cover problem is defined,the complexity of the EBC2 problem is studied,it is proved that it belongs to NP-complete problem,and the exact algorithm of the 2-EBC problem O(n2)time complexity is designed. |