Data Preprocessing And Pattern Mining In Multiple Data Sources

Posted on:2015-03-08

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y J Lin

Full Text:PDF

GTID:1268330428974533

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the raid development of database, network and other information technologies, multiple data sources with large volumes and heterogeneity have become ubiquitous in many practical applications, such as sensor networking, supermarket transactions and social media analysis. These databases contain plenty of useful information and valuable knowledge, and bring new characteristics as being heterogeneous, autonomous, complex, and inconsistent, which are challenging for traditional mining algorithms. Thus, knowledge discovery from multiple data sources, such as label propagation, quality of source evaluation, and pattern mining, is a significant problem with application values in real-world applications. The main contributions of this dissertation are as follows.1) It is difficult to merge multiple data sources into a centralized database for learning due to the inconsistency between different data sources. We present two label propagation methods to infer the labels of training objects from unlabeled sources by making a full use of class label information from labeled sources, and internal structure information from unlabeled sources, which are referred to as global consensus and local consensus, respectively. We test the classification accuracy, robustness and scalability of the proposed methods by constructing a multiple-data-source ensemble learning model. Experimental results show that the local consensus outperforms the global consensus when there exist plenty of unlabeled sources.2) It is noticeable that some sources might be irrelevant or redundant when constructing multiple-data-source learning. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance. We present an algorithm of source assessment and selection based on max-significance-min-redundancy, in which significance represents the degree to which an information source contributes to classification, and redundancy implies the information overlap among different information sources. Finally, we select the first p percent sources to construct multiple-data-sources ensemble learning. Experimental results show that the metric can effectively select some sources related to the target mining task.3) Every time when a customer interacts with a business, there is an opportunity to gain strategic knowledge. Transactional data collected over time contain a wealth of information about customers and their purchasing patterns. We divide transactional data into multiple time-stamped databases according to their sale periods. We present an efficient algorithm for mining four patterns represented by stable patterns. First, we define the notion of stable items according to two constraint conditions:minsupp and varivalue. We then measure the similarity between stable items based on gray relational analysis, and propose a hierarchical gray clustering method for mining stable patterns consisting of stable items. Finally, experimental results show that the proposed algorithm is effective, efficient and scalable.

Keywords/Search Tags:

Multiple Data Sources, Quality Assessment, Label Propagation, Pattern Mining

PDF Full Text Request

Related items

1	Data Quality Assessment Model And Quality Propagation For Relational Database
2	Study On Data Quality Assessment Techniques For Telecom Data Mining
3	The Construction Of Online Review Knowledge Map Based On Multiple Data Sources
4	Research And Implementation Of Self-adaptive Label Propagation Algorithm
5	Research And Application Of Controllable Query For Multiple Data Sources
6	Research On No-reference Image Quality Assessment Algorithm Based On Multiple Annotators
7	The Study Of Robust Semi-Supervised Classification Algorithm Based On Label Prediction And Propagation
8	Study Of Label Propagation Clustering Algorithm Based On Data Features
9	Research And Implementation Of Subject-Oriented Structured Data Integration On Multiple Web Sources
10	Underwater Image Quality Evaluation Database Based On Preference Label