Font Size: a A A

Study On Some Key Issues Of Biological Data Integration

Posted on:2006-08-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:S L CaoFull Text:PDF
GTID:1118360155460570Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid accumulation of sequenced genomes and the expeditious development and application of biochip, mass spectrum (MS), combination chemistry, biochemistry, and other high-throughput technologies, the bio-data has been growing exponentially.The explosion of bio-data makes it necessary and urgent for us to develop an adequate bio-data managing and analyzing system so that the immense bio-data can be managed efficiently, interpreted correctly, and utilized properly. Without the assistance of such a system, we won't be able to find the approaches to the breakthroughs in future study but lose in the bio-data jungle.The required bio-data are extraordinarily difficult to be accessed due to their distributed, heterogeneity, and other characteristics. Therefore, the capability of integrating large quantity of heterogeneous biological resources is decisive in the development of modern biology. Biological integration system, which can acquire the biological data of quality from different data sources quickly and support the analysis and mining the useful information from the dispersed but interrelated databases, is significant theoretically and practically.The dissertation consists of three parts: Firstly, the overview of biological data integration is given. Secondly, some problems in data integration using the data warehouse method, such as data extraction and transform, semantic similarity measurement, and semantic search based on Gene Ontology, are investigated. Finally, a biological data integration system - BioDW is introduced. The achievements presented in this dissertation are as follow:(1) A novel method of semi-structured data model presentation and data extraction is given.The most data in biological data sources are semi-structured. This semi-structured nature determines that these data exist in extremely diverse formats. Consequently, the relationships among these data are nested, and there is no order at all in the local relationships. Besides, these data keep missing and their structures change constantly. All of these make bio-data extraction a real challenge. The novel semi-structured data model presentation method is developed to deal with the complexity and uncertainty of the bio data in the biological data sources. Thismethod organically combines the OEM data model and regular expression. It is not only able to expediently present all kinds of data structure, but also easily bind the target database model to the information to be extracted. Based on the data model presentation methods, a series of methods of extracting data are designed. They can nicely solve the problems in extracting semi- structured biological data. The algorithm of developing the methods has been proved highly effective in experiments.(2) A semantic similarity measurement between Gene Ontology terms is proposed.As the fact standard for biological ontology, Gene Ontology is widely used in annotation on genes and gene products in data sources. In order to implement similarity search on semantic level in the same or among different data sources, it is necessary to measure the similarities among the terms in gene ontology itself. According to the structural characteristic of gene ontology, a combined method of measuring similarities among terms in gene ontology is proposed. Firstly, the information content of each node is calculated under the guidance of information content theory. Secondly, the information content of the node shared by the semantics path of two nodes and of all the nodes in the semantic paths are computed respectively. Finally, the ratio between the information contents obtained in the last step is used as the value of similarity between the two nodes. Using relative analysis the result of this method and result of other some methods and the result that is generated by human judgment with the same data set, the result shows that the proposed method gets the highest related coefficient, up to 0.8638.(3) The semantic similarity search based on Gene Ontology is implemented.Based on the proposed semantic similarity measurement method, using the corresponding relations between the Gene Ontology terms and the database entities annotated by those terms, the semantic similarity search is implemented in the same data source and the different data sources. The experiments show that the comparison of semantic similarities from the point of view of molecular function can superiorly reflects the far and near relationships of functions of gene products. It also reflects the positive correlation between the sequence alignment and semantic similarity search to a certain extent.(4) A biological data integration system - BioDW is designed andimplementedThe advantages of BioDW are as follows: 1) A tool of data format transformation is created to solve the heterogeneity problem resulted from the different structures of data sources. It can transform the data from heterogeneous and autonomic biological data sources into the files that cari be directly loaded into the data warehouses. It is convenient for unifying the data from the various data sources to conform the relational database model. 2) DBREF is used to accomplish LinkDB function. The cross-linking relationships between different data sources are stored in a unified DBREF table. DBREF allows a user to quickly search out all the items related to the user specified item no matter in which data sources reside those items. 3) Gene Ontology (GO) is imported as a tool of data clustering. A DB2GO table is generated to relate the items from member data sources to GO. Several ways of semantic search among heterogeneous biological databases is implemented based on GO. 4) The different ways of querying data are provided. Each type of data can be acquired quickly and conveniently by adopting an appropriate way of querying. 5) A method based on MD5 algorithm for incremental data updating is implemented. This method makes updating reach a higher efficiency.
Keywords/Search Tags:Bioinfomatics, Integration, Extraction, Update, Ontology, Similarity, Gene Ontology, BioDW
PDF Full Text Request
Related items