Font Size: a A A

Mining Shared Knowledge\Patterns Between Two Datasets

Posted on:2012-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Mamadou Lamarana BAHFull Text:PDF
GTID:2268330425484169Subject:Computer Applications
Abstract/Summary:PDF Full Text Request
How to characterize the differences between two data sets is an important problem in data mining. Decision tree is a widely used method to this problem in order to find some interesting patterns, which can be used to describe significant changes between a pair of data sets. Decision tree is a data mining model that is useful as a mean of classification of data into a given set of categories. However, current decision tree has not investigated existing classification similarities between two distinct datasets in order to mine shared patterns between these two datasets.In this thesis, we study the mining of shared patterns between two datasets. We investigate a special type of shared patterns, data distribution similarity of a tree that can be mined from two distinct datasets. We merge the concept of shared patterns, classification, binary tree data structure, in one robust shared binary tree mining algorithm named MiningSBT, for mining a high quality shared binary tree between two datasets.To efficiently mine a shared binary tree between two datasets (D1, D2), we suppose that D1is an understood dataset, denoted as UD1, and D2is a not very well understood dataset and denoted as ND2. The use of UD1can help to understand ND2by the proposed mining algorithm, because the resulting shared binary tree should satisfy these two following properties:high similarity of data distribution and high classification accuracy in the two given datasets.Our shared binary tree is similar to a traditional decision tree in the tree building process, but this mining algorithm is different from traditional decision tree mining algorithm because of the challenges facing the presence of two datasets, the data distribution similarity requirement and the classification accuracy requirement.Experimental results with the use of real-world and synthetics datasets show that the algorithm is effective.One important purpose of conducting gene expression experiments is to understand the cor-relation of gene expression profiles to disease states. Based on the notion of emerging patterns and an entropy-oriented discretization method, we discover groups of genes that are correlated to disease states in a significant way. In each group, every member gene constrained by a specific expression interval, unanimously occurs only in one type of cells with a maximally large frequency, but never unanimously happens in the other types of cells. According to a study of a colon tumor dataset, such gene groups (also called patterns) can reach a frequency of90%, providing good insight into the correlation of gene expression profiles to disease states. The patterns can be used to correctly predict whether a new cell is normal or cancerous.Overview of Databases and Data MiningRoughly speaking a database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality (for example, the availability of students grades in a school), in a way that supports processes requiring this information (for example, finding a grade for a given student).The term database may refer to a particular aspects of organized collection of data such as to the logical database, to physical database as data content in computer data storage or to many other database sub-definitions.The term database is correctly applied to the data and their supporting data structures, and not to the database management system (DBMS). The database data collection with DBMS is called a database system. The term database system implies that the data is managed to some level of quality (measured in terms of accuracy, availability, usability, and resilience) and this in turn often implies the use of a general-purpose database management system (DBMS). A general-purpose DBMS is typically a complex software system that meets many usage requirements, and the databases that it maintains are often large and complex. The utilization of databases is now spread to such a wide degree that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have such embedded in it. Also, organizations and companies, from small to large, heavily depend on databases for their operations.Well known DBMSs include Oracle, IBM DB2, Microsoft SQL Server, PostgreSQL, MySQL. A database is not generally portable across different DBMS, but different DBMSs can inter-operate to some degree by using standards like SQL and ODBC to support together a single application. A DBMS also needs to provide effective run-time execution to properly support (e.g., in terms of performance, availability, and security) as many end-users as needed.The design, construction, and maintenance of a complex database requires specialist skills:the staff performing these functions are referred to as database application programmers and database administrators. Their tasks are supported by tools provided either as part of the DBMS or as stand-alone software products. These tools include specialized database languages including data definition languages (DDL), data manipulation languages (DML), and query languages. These can be seen as special-purpose programming languages, tailored specifically to manipulate databases; sometimes they are provided as extensions of existing programming languages, with added database commands. Database languages are generally specific to one data model, and in many cases they are specific to one DBMS type. The most widely supported database language is SQL, which has been developed for the relational data model and combines the roles of both DDL, DML, and a query language.A way to classify databases involves the type of their contents, for example:bibliographic, document-text, statistical, or multimedia objects. Another way is by their application area, for example:accounting, music compositions, movies, banking, manufacturing, or insurance.Data MiningData Mining is the analysis step of the knowledge discovery in databases process), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating. The term is a buzzword, and is frequently misused to mean any form of large-scale data or information processing (collection, extraction, warehousing, analysis and statistics) but also generalized to any kind of computer decision support system including artificial intelligence, machine learning and business intelligence. In the proper use of the word, the key term is discovery, commonly defined as "detecting something new". The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining). This usually involves using database techniques such as spatial indexes. These patterns can then be seen as a kind of summary of the input data, and used in further analysis or for example in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system.General-purpose DBMSA DBMS has evolved into a complex software system and its development typically requires thousands of person-years of development effort. Some general-purpose DBMSs, like Oracle, Microsoft SQL server, and IBM DB2, have been undergoing upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many applications as possible, which typically makes them even more complex than special-purpose databases. However, the fact that they can be used "off the shelf", as well as their amortized cost over many applications and instances, makes them an attractive alternative whenever they meet an application’s requirements.Types of people involvedThree types of people are involved with a general-purpose DBMS1. DBMS developers-These are the people that design and build the DBMS product, and the only ones who touch its code. They are typically the employees of a DBMS vendor (e.g., Oracle, IBM, Microsoft, Sybase), or, in the case of Open source DBMSs (e.g., MySQL), volunteers or people supported by interested companies and organizations. They are typically skilled systems programmers. DBMS development is a complicated task, and some of the popular DBMSs have been under development and enhancement (also to follow progress in technology) for decades.2. Application developers and Database administrators-These are the people that design and build a database-based application that uses the DBMS. The latter group members design the needed database and maintain it. The first group members write the needed application programs which the application comprises. Both are well familiar with the DBMS product and use its user interfaces (as well as usually other tools) for their work. Sometimes the application itself is packaged and sold as a separate product, which may include the DBMS inside (see Embedded database; subject to proper DBMS licensing), or sold separately as an add-on to the DBMS.3. Application’s end-users (e.g., accountants, insurance people, medical doctors, etc.)-These people know the application and its end-user interfaces, but need not know nor understand the underlying DBMS. Thus, though they are the intended and main beneficiaries of a DBMS, they are only indirectly involved with it.Database researchDatabase research has been an active and diverse area, with many specializations, carried out since the early days of dealing with the database concept in the1960s. It has strong ties with database technology and DBMS products. Database research has taken place at research and development groups of companies (e.g., notably at IBM Research, who contributed technologies and ideas virtually to any DBMS existing today), research institutes, and Academia. Research has been done both through Theory and Prototypes. The interaction between research and database related product development has been very productive to the database area, and many related key concepts and technologies emerged from it. Notable are the Relational and the Entity-relationship models, the Atomic transaction concept and related Concurrency control techniques, Query languages and Query optimization methods, and more. Research has provided deep insight to virtually all aspects of databases, though not always has been pragmatic, effective (and cannot and should not always be:research is exploratory in nature, and not always leads to accepted or useful ideas). Ultimately market forces and real needs determine the selection of problem solutions and related technologies, also among those proposed by research. However, occasionally, not the best and most elegant solution wins (e.g., SQL). Along their history DBMSs and respective databases, to a great extent, have been the outcome of such research, while real product requirements and challenges triggered database research directions and sub-areas.The database research area has several notable dedicated academic journals (e.g., ACM Transactions on Database Systems-TODS, Data and Knowledge Engineering-DKE, and more) and annual conferences (e.g., ACM SIGMOD, ACM PODS, VLDB, IEEE ICDE, and more), as well as an active and quite heterogeneous (subject-wise) research community all over the world.
Keywords/Search Tags:databases, data mining, emerging patterns, decision tree, binary tree, sharedmodel\patterns, jumping emerging patterns, Gene expression data, classification
PDF Full Text Request
Related items