Font Size: a A A

A Study On Markov-Model-based Sequence Classification

Posted on:2015-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:H H WuFull Text:PDF
GTID:2298330467461804Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Classification is a supervised machine learning method, and it has been widely used in many fields, such as risk assessment in the bank, customer categorization, automatic classification of documentation, etc. With the fast development of technology, more and more event sequence data spring up in classification application, which is one kind of non-numerical data. For example, protein sequences, the buying activity logs of customer at the mall, website click-streams of user and so on. The universality of event sequences makes it highly important to classify event sequences rapidly and accurately over model-based methods.The characteristics of event sequences are quite different from the traditional numerical data. The event sequence is made up with discrete symbols, so those distance metric algorithms commonly used in numerical data cannot be applied on it. And a sequence is an ordered list of events, feature extraction followed by conventional classification algorithms will occur information loss problems. Due to the special characteristics of the event sequence, many classification algorithms which perform well on numeric data cannot obtain good result when applied to it.In order to address these problems, some new Markov models are proposed in this thesis, which focuse on the issue of event sequence classification, based on the statistical model of event sequences. And we also implement a distributed Markov model algorithm based on Apache Hadoop for big data. The researches in this dissertation has much theoretical and practical significance.The majority of our contributions can be summarized as follows:1. A new weighted variable length Markov model is proposed, where the probability of subsequence and transition probabilities of sequence elements are combined, to optimize the classification model. And a new similarity pruning strategy executed when building the model is also proposed, which enhances generalization of the model.2. For the practical application of only a small amount of training data, an automatic weighted variable length Markov model based on nominal attribute kernel smoothing method is proposed to obtain optimal estimation sample bias and variance estimation, result in the improvement of classification accuracy on small amount of data.3. To deal with big data, the variable length Markov model is ported to Apache Hadoop distributed platform to get distributed parallel computing, which aims to solve the storage capacity limitation of stand-alone machine for big data and the bottleneck of large-scale computing capacity.
Keywords/Search Tags:event sequence, classification, Markov model, weighted, distributedcomputing
PDF Full Text Request
Related items