Font Size: a A A

Using Regression Methods To Estimate The Length Of The Longest Frequent Patterns

Posted on:2016-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:H X ZhouFull Text:PDF
GTID:2308330479999193Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data mining is a nontrivial process of extracting patterns of effective, original, potential, believable and ultimately been understood from large amounts of data. Sequential pattern mining, an important branch of data mining research, is used for finding kinds of rules and extracting valuable knowledge and information hidden in a large amount of data from various application fields. Mining frequent patterns with periodic wildcard gaps is a kind of sequential pattern mining with wildcard gap constraints. It requires that wildcard gap exists between item and item of a pattern and the sizes or scopes of gaps which meet user-specified number are the same. The form of frequent patterns with periodic wildcard gaps can be described as a1[M,N]a2[M,N]a3[M,N]…am-1[M,N]am, in which M and N represent the minimum and maximum gap sizes, respectively. In mining sequential patterns with periodic wildcard gaps on DNA sequences, an important task is to predict the length of longest frequent patterns which is estimated in most of the current existing algorithms of sequential pattern mining with periodic wildcard gaps. While there is no effective method to calculate it and it is usually given by an experienced way. Therefore, this issue is researched in this paper.The method of regression is adopted in this paper and the research of the subject is conducted according to the following three steps. The first one is obtaining the regression target, it uses sequential pattern mining algorithms with periodic gap constraints to do this mining in sets of DNA sequences and statistic the length of the longest frequent patterns in the all kinds of gaps and threshold, so we get the objective results. The second one is feature selection, this paper calculates the frequency of length 2 patterns in the DNA sequences to get the first 16 dimensions of data sets and the 17 th dimension is the threshold of mining sequential patterns, the 18 th dimension is the length of the longest frequent patterns. The last step is building learning machine through regression method. The training sets and the testing sets have been obtained according to the first two steps. In this paper, BP-network, Least Squares Support Vector Machines(LS-SVM) and Extreme Learning Machine(ELM) are employed to learn these training sets, after which the testing sets are used to test the former learning effects.Finally, to regress the length of the longest frequent patterns, two groups of experiments are designed in this paper; one is different thresholds and gaps, and the other is different thresholds and sequences. The experimental results showed that: ELM has better generalization performance, especially, when threshold and sequence are changed.
Keywords/Search Tags:Sequential patterns mining, gap, the greatest length of frequent patterns, BP-network, LS-SVM, ELM
PDF Full Text Request
Related items