Font Size: a A A

Peculiar Data Mining In Multi Data Sources

Posted on:2011-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:H CaoFull Text:PDF
GTID:2178360305477862Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining technology can get much important knowledge from mining the data in the database, including classification knowledge, clustering patterns, association rules, and sequential patterns and so on. Association rules are a kind of rules obtained by analyzing the links between the frequent data in the database, which is the most classic example of data mining technology . Be noted that, commonly used methods of data mining always mine the general rules hiding in most of the ordinary data in the database, but the rules which hide in a small part of special data can not be mined by these methods, which are often of important value. Peculiar rules are that kind of rules, they reflect the relationship between a very small numbers of objects in the database, but they are well-known facts with common sense, which can not be found out by common method for mining association rule.On the other hand, with the development of database and network technology, people are no longer limited to storing data in one database; in stead they store data in several geographical distributed databases. And multi-database mining methods must be used when they need to mine these databases. Existing methods for multi-database mining are divided into the following three categories: (1) To integrate multiple databases into one database, then the traditional method of single-database mining is used to mine this database. This kind of methods will generate a large number of records in the database connection process, and might cause serious problems such as data inconsistency, data conflict and so on. (2) To mine each local database, and then integrate all the local patterns in each database to get the global model. This kind of methods might destroy some global patterns. (3 ) By importing inductive logic programming (ILP) technology, extract related global patterns from multiple databases directly. This kind of methods has many constraints in use such as strict input format and very low operating efficiency and so on.This article studies peculiar data mining in multi -data sources, aims to settle the two issues mentioned above, the main contents are as follows:(1) Proposes a new definition of distance (or similarity) between databases, which can measure not only distances between transaction databases, but also distances between statistical databases. And then a multi-database classification method called AN-DBC based on clustering is designed according to this definition. Databases coming from multiple data sources can be clustered by AN-DBC method according to similarity, database which are the same or similar in structure was assigned to the same cluster, while databases of more different structures are assigned to different clusters. Database in the same clusters are considered to be of the same type, and can be integrated into one database or mined with the same data mining method according to the same structure. To classify all databases before mining, compared with traditional direct integration of all the databases for further data mining, can greatly reduce the complexity of the algorithm. In addition, compared to the method of mining local databases and then integrate local patterns, our method can reduce the damage of global patterns to a certain extent.(2) Analysis the existing peculiar data mining method, and points out the disadvantage of peculiarity threshold setting, after that puts forward an improved setting of peculiarity threshold. This article defines a peculiarity rate factor r, first to calculate the peculiar factor of all property values, property values whose peculiarity factors are listed in the former 100r% are considered to be peculiar data. This allows that for each property, we can find a number of peculiar data.(3) Follow the method of generating association rules, we calculate the probability of peculiar data, which exist hand in hand, as their relationship. Then finally extract peculiar rules.(4) Select 18 relations (databases) randomly from the China Yearbook of all industries, which are proclaimed on official website of State Statistics Bureau, as the experimental data. First to cluster all databases with the AN-DBC method, clustering results demonstrate the effectiveness of the method, and then use the improved peculiar data mining method to mine the peculiar data from these clusters of databases. By integrating local peculiar rules we can get global peculiar rules. Finally, the experimental results are analyzed and discussed.
Keywords/Search Tags:peculiar data mining, peculiar rule, multi-database mining, database classification
PDF Full Text Request
Related items