Font Size: a A A

Shared Emerging Sequences Mining Algorithm And Applications

Posted on:2015-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2428330488499548Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Emerging sequences(ESs)aims at discovering sequences that are frequent in sequences of one group but less frequent in sequences of another in the database,and thus can distinguish or contrast sequences of different classes efficiently.However,the existing algorithms of ESs mining mostly focus on a single dataset,and situations of two or more datasets are not considered.In this paper,a new kind of ESs,named as shared emerging sequences(SESs)is introduced,which stands for ESs that shared by two or more datasets.SESs can represent some common features between datasets,and have great potential in transfer learning and analogy.Focusing on the mining algorithm and applications of SESs,the main contents and contributions of this paper are as following:(1)Directing at the SESs mining problem,a framework to min e SESs is put forward.In the framework,a shared generalized suffix-tree(Shared GS-Tree)is used to store the data.The tree has a feature that mining ESs of multiple datasets and multiple classes in just one tree,which can simplify the mining process,reduce the complexity of mining space,and improve the run-time efficiency.On this basis,a Shared GS-Tree based SESs mining algorithm is proposed.The algorithm uses a depth-first search strategy to obtain ESs of each dataset,and then produces the final SESs through similarity matching.To improve the performance of the algorithm,three pruning strategies are adopted,which include length threshold pruning,max-prefix infrequency pruning,and similarity match length difference pruning.Experimental results show that combined with three pruning strategies,SESs mining algorithm achieves good time performance.(2)SESs can transfer knowledge of similar domains,but how to measure the similarity of two domain datasets? So an aggregated SESs based similarity measure strategy is introduced to calculate the similarity of two datasets.Definition of the quality of SESs is given first,then combining the average quality and quantity to calculate the SESs contribution scores of datasets,last aggregating the contribution of SESs to measure the similarity of datasets.The classification experimental results show that the measure strategy is effective.So when insufficient training data occurs,to improve the classification accuracy,we can select a similar dataset by aggregated SESs to serve as as auxiliary dataset.(3)Because there exist great relevance between negative transfer and similarities of datasets,and SESs can be used to measure the similarity between datasets,we utilize SESs to analyze negative transfer across domain datasets.When the target dataset is novel with little known labeled data,we can consider using SESs to co-classification,and experimental results demonstrate that co-classification can improve the classification accuracy.
Keywords/Search Tags:Data Mining, Emerging Sequences, Shared Emerging Sequences, Shared Knowledge, Similarity Measure
PDF Full Text Request
Related items