Font Size: a A A

Sequential Data Analysis Based On Multiple Hypothesis Testing

Posted on:2020-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:S M ZhangFull Text:PDF
GTID:2370330590496786Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Sequence is a type of common data structure.Sequential data are composed of a set of sequences,where each sequence is a list of ordered elements.There are lots of applications of sequential data in real life,for example: monitoring blood pressure,predicting online shopping link click sequences,etc.Research fields in the analysis of sequences include classification,clustering,pattern discoveries,etc.This literature mainly focuses on discriminative pattern mining and sequence classification.Existing algorithms both in area of discriminative sequential pattern mining and sequence classification algorithms are lack of quality control of the results,which lead to the low quality of mining patterns and low classification accuracy.When the result set contains too many false positives,The difficulty of using the patterns for users will be increased.For the classifier,The classification accuracy and performance will be directly affected.To solve the task of high quality discriminative sequential pattern mining,we put forward a new problem,called significance-based discriminative pattern mining,which is to find a set of discriminative sequential patterns under the framework of multiple hypothesis testing.At the same time,we propose a corresponding algorithm called DSPM-MTC(Discriminative Sequential Pattern Mining with Multiple Testing Correction).DSPM-MTC employs the Bonferroni correction method and BH method to control FWER and FDR,respectively,which will provide a stable and high quality result set.The experimental results show that DSPMMTC can filter abundant false positives and provide a high quality pattern set.For the problem of sequence classification,we propose a new sequence classification algorithm: MTC-Sclassifier(Multiple Testing Correction based Sequential Classifier).MTCSclassifier transforms the classification problem into a hypothesis test problem,and quantifies the statistical significance in terms of p-values which can effectively control the error rate of classification results.MTC-Sclassifier employs the two-sample testing to test the probability that unclassified samples belong to positive set or negative set.In order to avoid the effect of outliers and irrelevant training data,MTC-Sclassifier conducts the hypothesis testing on two samples that are derived from the k-NNs of the test sequence.Meanwhile,MTC-Sclassifier utilizes FDR to control the number of wrongly classified sequences and recognizes the outliers.The MTC-Sclassifier algorithm can achieve a satisfactory classification accuracy.By utilizing FDR,MTC-Sclassifier can control the number of misclassified samples,meanwhile,outliers can be figured out as well.
Keywords/Search Tags:Sequential pattern, Discriminative pattern, Multiple hypothesis testing, Sequence classification
PDF Full Text Request
Related items