Using Regression Methods To Estimate The Length Of The Longest Frequent Patterns

Posted on:2016-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:H X Zhou

Full Text:PDF

GTID:2308330479999193

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Data mining is a nontrivial process of extracting patterns of effective, original, potential, believable and ultimately been understood from large amounts of data. Sequential pattern mining, an important branch of data mining research, is used for finding kinds of rules and extracting valuable knowledge and information hidden in a large amount of data from various application fields. Mining frequent patterns with periodic wildcard gaps is a kind of sequential pattern mining with wildcard gap constraints. It requires that wildcard gap exists between item and item of a pattern and the sizes or scopes of gaps which meet user-specified number are the same. The form of frequent patterns with periodic wildcard gaps can be described as a1[M,N]a2[M,N]a3[M,N]…am-1[M,N]am, in which M and N represent the minimum and maximum gap sizes, respectively. In mining sequential patterns with periodic wildcard gaps on DNA sequences, an important task is to predict the length of longest frequent patterns which is estimated in most of the current existing algorithms of sequential pattern mining with periodic wildcard gaps. While there is no effective method to calculate it and it is usually given by an experienced way. Therefore, this issue is researched in this paper.The method of regression is adopted in this paper and the research of the subject is conducted according to the following three steps. The first one is obtaining the regression target, it uses sequential pattern mining algorithms with periodic gap constraints to do this mining in sets of DNA sequences and statistic the length of the longest frequent patterns in the all kinds of gaps and threshold, so we get the objective results. The second one is feature selection, this paper calculates the frequency of length 2 patterns in the DNA sequences to get the first 16 dimensions of data sets and the 17 th dimension is the threshold of mining sequential patterns, the 18 th dimension is the length of the longest frequent patterns. The last step is building learning machine through regression method. The training sets and the testing sets have been obtained according to the first two steps. In this paper, BP-network, Least Squares Support Vector Machines(LS-SVM) and Extreme Learning Machine(ELM) are employed to learn these training sets, after which the testing sets are used to test the former learning effects.Finally, to regress the length of the longest frequent patterns, two groups of experiments are designed in this paper; one is different thresholds and gaps, and the other is different thresholds and sequences. The experimental results showed that: ELM has better generalization performance, especially, when threshold and sequence are changed.

Keywords/Search Tags:

Sequential patterns mining, gap, the greatest length of frequent patterns, BP-network, LS-SVM, ELM

PDF Full Text Request

Related items

1	Study On Frequent Pattern Mining Algorithms And Pruning Strategies
2	The Research On Key Problems Of Sequential Patterns Mining
3	Research On Application Of Sequential Mining In Discovery Of Clinical Behavior Patterns
4	Research On Mining And Querying Frequent Patterns Based On Simplified Frequent Pattern Tree
5	Research On Mining Algorithm Of Web Log Frequent Sequential Patterns
6	Mining frequent sequential patterns in data streams using SSM-algorithm
7	Research On Key Techniques Of Negative Frequent Patterns Mining Based On Multiple Minimum Supports
8	Mining Condensed Sets Of Sequential Patterns And Structured Patterns
9	The Techniques Research On Frequent Pattern Mining
10	Mining Closed Sequential Patterns Based On Bitmap