A Frequent String Mining Algorithm Based On Optimized LCP Table

Posted on:2012-09-17

Degree:Master

Type:Thesis

Country:China

Candidate:Z X Guo

Full Text:PDF

GTID:2178330335970091

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The frequent string mining is an important data mining problem, which is an important branch of text mining. The purpose of the frequent string mining is to find all strings that fulfill certain constraints of all string databases. It has attracted considerable attention from Biomedical and Information System researchers because of its broad applications in many areas such as analysis of genome on Biomedical, Information Retrieval, analysis and decision etc.Most of the current algorithms for frequent string mining are based on the data structure of suffix array and LCP table, which have several steps as follow:Firstly, connect all strings of all string databases as a connection string, these strings are split by a special symbol which is not used. Secondly, calculate the suffix array or suffix tree in accordance with the connection string. Finally, calculate the longer common prefix of all suffixes. We can use this data structure get all strings that fulfill certain constraints of all string databases. On the step of calculate suffix array or suffix tree, suffix array or suffix tree is calculated base on the connection string. The length of suffix array is long because of the length of the connection string is big. With the increased of the string database and strings in the string database, the connection string will bigger. And the suffix array will need more space. Otherwise, these algorithms will be generated some unnecessary suffix which contains division marks. These unnecessary suffixes make the average length of the suffix array and LCP table longer. It will need more space and time to deal with it.In this paper, based on the analysis of the existing algorithm-Slink_merge, we presented an algorithm-Slink_merge+ to calculate the suffix array and LCP table. We use start array and end array to store the position of the first and the last character in the connection string. The suffix of one string can be obtained by suffix array and the end array. All the suffix of all strings in string database is generated by each string. The space consumption can be reduced by shortened the length of the suffixes. In this paper, the division mark is not contained in suffix. So we needn't to remove the relevant substrings which contain the division mark, the process time can be saved. The experimental result shows that Slink_merge+ get great benefits from Slink_merge, it can greatly reduce the total space consumption and process time efficiently.

Keywords/Search Tags:

frequent string mining, substring, suffix, longer common prefix, suffix array, frequency, LCP table

PDF Full Text Request

Related items

1	Research On Construction Of Index Structure For Biological Sequences
2	Approximate String Matching Based On Sufifx Array
3	Finding MUMs With Enhanced Suffix Arrays
4	Approximate Longest Common Substring Matching And Optimization Techniques With Edit Distance Constraint
5	The String Pattern Searching Algorithms Based On Suffix Arrays
6	Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array
7	Research Of Finding Maximal Unique Matches In Genome
8	Frequent Sequence Pattern Mining In Web Log
9	Engineering Implementation And Performance Evaluation Of An External Memory Suffix Array Construction
10	Code Plagiarism Detection Research Based On Suffix Tree