Font Size: a A A

A Frequent String Mining Algorithm Based On Optimized LCP Table

Posted on:2012-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z X GuoFull Text:PDF
GTID:2178330335970091Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The frequent string mining is an important data mining problem, which is an important branch of text mining. The purpose of the frequent string mining is to find all strings that fulfill certain constraints of all string databases. It has attracted considerable attention from Biomedical and Information System researchers because of its broad applications in many areas such as analysis of genome on Biomedical, Information Retrieval, analysis and decision etc.Most of the current algorithms for frequent string mining are based on the data structure of suffix array and LCP table, which have several steps as follow:Firstly, connect all strings of all string databases as a connection string, these strings are split by a special symbol which is not used. Secondly, calculate the suffix array or suffix tree in accordance with the connection string. Finally, calculate the longer common prefix of all suffixes. We can use this data structure get all strings that fulfill certain constraints of all string databases. On the step of calculate suffix array or suffix tree, suffix array or suffix tree is calculated base on the connection string. The length of suffix array is long because of the length of the connection string is big. With the increased of the string database and strings in the string database, the connection string will bigger. And the suffix array will need more space. Otherwise, these algorithms will be generated some unnecessary suffix which contains division marks. These unnecessary suffixes make the average length of the suffix array and LCP table longer. It will need more space and time to deal with it.In this paper, based on the analysis of the existing algorithm-Slink_merge, we presented an algorithm-Slink_merge+ to calculate the suffix array and LCP table. We use start array and end array to store the position of the first and the last character in the connection string. The suffix of one string can be obtained by suffix array and the end array. All the suffix of all strings in string database is generated by each string. The space consumption can be reduced by shortened the length of the suffixes. In this paper, the division mark is not contained in suffix. So we needn't to remove the relevant substrings which contain the division mark, the process time can be saved. The experimental result shows that Slink_merge+ get great benefits from Slink_merge, it can greatly reduce the total space consumption and process time efficiently.
Keywords/Search Tags:frequent string mining, substring, suffix, longer common prefix, suffix array, frequency, LCP table
PDF Full Text Request
Related items