Font Size: a A A

A Study On HMM Based Representation Learning For Symbolic Sequences

Posted on:2022-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:B X ChenFull Text:PDF
GTID:2518306752969199Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advancement of science and technology,there are more and more complex types of data appearing in the real world,including symbolic sequence.While data mining has made a lot of progress in the past few decades,but most algorithms are still limited to the analysis and mining of numerical data.Therefore,the application of extending further data mining algorithms to symbolic sequences is still a worth studying problem.Symbolic sequence is an ordered list of events,this kind of special symbolic data is arranged by a finite number of discrete symbols according to a certain temporal or spatial order.The data scale of symbolic sequences is increasing day by day,and it has been widely used in many fields such as science,business,medicine,security,etc.Because symbolic sequences are unstructured and have different length and size,they cannot directly be used on some classical data mining algorithms based on numerical data.However,through representation learning,symbolic sequences can be expressed as vectors,which can make some algorithms such as k-Nearest Neighbors(KNN),K-Means directly applied to the mining and analysis of symbolic sequences.Therefore,the representation learning for symbolic sequences has certain theoretical significance and important application value.This paper proposes a study on HMM based representation learning for symbolic sequences,and studies the two key problems of feature representation and model selection of symbolic sequences.Firstly,a new distance measure for sequences and training method of hidden Markov model are proposed.Secondly,a model selection method based on hidden state clustering is proposed to estimate the number of hidden states of symbolic sequences.Finally,a probability vector representation method for symbolic sequences is proposed and applied to the classification of symbolic sequences with prototype classifier.The main work and contributions of this paper are as follows:1.Constructed an unsupervised symbol sequence representation learning framework,and proposed a two-stage pre-training method(Pre-training HMM),which allows different sequences to obtain their own state transition matrix as a feature representation under the hidden state sharing condition of the hidden Markov model,and a new measure of sequence distance based on the state transition matrix of hidden Markov model is defined for the classification of symbolic sequences.2.Aiming at the problem that it is difficult to determine the hidden state of symbolic s equences,a pre-training HMM method based on model selection is proposed.The hidden states of symbolic sequences are learned in an unsupervised way,and a model selection method based on state clustering(MSSC)is proposed.3.In order to solve the problem of high dimensional representation of symbolic sequence,a symbolic sequence representation method based on probability vector is proposed.In this method,the multi sequence training algorithm of hidden Markov model is used to construct the probability vectors representation of symbolic sequences in each class of dataset,and a new classification method of symbolic sequence is constructed by combining this feature representation with prototype classifier.
Keywords/Search Tags:Symbolic sequence, Pre-training HMM, Distance measurement of sequences, Model selection, Feature representation
PDF Full Text Request
Related items