Font Size: a A A

Data Preprocessing And Pattern Mining In Multiple Data Sources

Posted on:2015-03-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J LinFull Text:PDF
GTID:1268330428974533Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the raid development of database, network and other information technologies, multiple data sources with large volumes and heterogeneity have become ubiquitous in many practical applications, such as sensor networking, supermarket transactions and social media analysis. These databases contain plenty of useful information and valuable knowledge, and bring new characteristics as being heterogeneous, autonomous, complex, and inconsistent, which are challenging for traditional mining algorithms. Thus, knowledge discovery from multiple data sources, such as label propagation, quality of source evaluation, and pattern mining, is a significant problem with application values in real-world applications. The main contributions of this dissertation are as follows.1) It is difficult to merge multiple data sources into a centralized database for learning due to the inconsistency between different data sources. We present two label propagation methods to infer the labels of training objects from unlabeled sources by making a full use of class label information from labeled sources, and internal structure information from unlabeled sources, which are referred to as global consensus and local consensus, respectively. We test the classification accuracy, robustness and scalability of the proposed methods by constructing a multiple-data-source ensemble learning model. Experimental results show that the local consensus outperforms the global consensus when there exist plenty of unlabeled sources.2) It is noticeable that some sources might be irrelevant or redundant when constructing multiple-data-source learning. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance. We present an algorithm of source assessment and selection based on max-significance-min-redundancy, in which significance represents the degree to which an information source contributes to classification, and redundancy implies the information overlap among different information sources. Finally, we select the first p percent sources to construct multiple-data-sources ensemble learning. Experimental results show that the metric can effectively select some sources related to the target mining task.3) Every time when a customer interacts with a business, there is an opportunity to gain strategic knowledge. Transactional data collected over time contain a wealth of information about customers and their purchasing patterns. We divide transactional data into multiple time-stamped databases according to their sale periods. We present an efficient algorithm for mining four patterns represented by stable patterns. First, we define the notion of stable items according to two constraint conditions:minsupp and varivalue. We then measure the similarity between stable items based on gray relational analysis, and propose a hierarchical gray clustering method for mining stable patterns consisting of stable items. Finally, experimental results show that the proposed algorithm is effective, efficient and scalable.
Keywords/Search Tags:Multiple Data Sources, Quality Assessment, Label Propagation, Pattern Mining
PDF Full Text Request
Related items