Font Size: a A A

Sequential Pattern Mining With Wildcards

Posted on:2012-09-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:F XieFull Text:PDF
GTID:1118330371473661Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
There is a huge wealth of sequence data available in real-word applications, such as customerpurchase histories, telephone call records, DNA and protein sequences, and text data etc. It is aworthwhile goal to analyze this wealth of data and mine important knowledge. Sequential patternmining aiming at finding frequent sequential patterns from a sequence database has become animportant reseach task in the field of data mining. Although recurring patterns convey important anduseful knowledge, they rarely reproduce themselves, but rather appear with a slightly different formin each of the occurrences. There are possible flexible wildcards between any two consectivecharcters of a pattern, for example the insertions or deletions of short periods in DNA sequences.Therefore, the research on sequential pattern mining with wildcards not only has importanttheoretical value, but it can be applied in many dommains, such as text mining, bioinformatics, andsensor networks.This dissertation focuses on sequential pattern mining with wildcards and its applications in textmining. The main research work includes three aspects. Firstly, we will formally define the problemof sequential pattern mining with flexible gap constratints and the one-off condition, and designeefficient mining algorithms. Secondly, the problem of keyword extraction is also studied in thisdissertation based on semantic relations. Thirdly, the designed mining algorithm with wildcards isused for keyword extraction. Sequential patterns are used to analyze the semantic relations betweenwords and a keyword extraction model is built. The major contributions of this dissertation are asfollows:(1) The problem of sequential pattern mining with wildcards is formally defined. Given asequence S, the gap constraint specified by the user, and the minimal support threshold, our goal isto mine all frequent sequential patterns from S with the minimal support threshold and the gapconstraint. The one-off condition is also incorporated into the problem. That is, each character in thegiven sequence can be used at most once in all occurrences of a pattern. A sequential pattern miningalgorithm with wildcards, called One-off Mining, is proposed based on the breadth-first searchstrategy. The one-way scan is used to compute the support of a pattern with the gap constraint andthe one-off condition. Based on the Apriori property, the set of candidate k-sequences is generatedby the set of all frequent (k-1)-sequences. Expeirmental results show that more frequent patterns aremined by the One-off Mining algorithm, and the time performance is greatly improved.(2) An efficient pattern-growth based mining algorithm MAIL is presented that utilizes thecandidate occurrences of the prefix to obtain the candidate occurrences of the extended pattern andavoids the rescanning of the sequence. Two pruning strategies including the left-most pruning andthe right-most pruning are designed in MAIL to improve the completeness and the time efficiency. Experiments demonstrate that MAIL improves the completeness of the solution and the timeefficiency.(3) A novel data structure, named HDAG (Hierarchy Directed Acycline Graph), is proposed toobtain and store the exponential number of candidate occurrences of a pattern in a linear time andspace complexity. The occurrences that contribute to the support of the pattern are selected by thedepth-first traversal of the graph. It can be proved that all candidate occurrences of a pattern arestored in the occurrence graph, and an optimal occurrence is selected by the depth-first traversal ofthe root node. Experiments show that the pattern mining algorithm based on HDAG approximatesthe maximal support with more than90%accuracy.(4) The problem of keyword extraction is studied using sequential patterns with wildcards. Thelength of a texual sequence is usually long, and the alphabet is large. So, the MAIL algorithm isimproved to be efficient. Sequential patterns with wildcards are mined from the document sequencesthat reflect the semantic relatedness between words. The pattern features within words are used totrain the keyword extraction model. The influence of selected pattern features on the quality of thekeywords extracted is also discussed. Experimantal results show that the pattern features improvethe performance of keyword extraction, and the method is not restricted by the language and thethesaurus. A keyword extraction method for Chinese news web pages using lexical chains isproposed. Semantic relations between words are obtained by the HowNet thesaurus. The lexicalchain model is used to score candidate words and improve the cohesion of the keywords extracted.
Keywords/Search Tags:Wildcards, Sequential Pattern Mining, One-off Condition, Hierarchy DirectedAcycline Graph, Keyword Extraction
PDF Full Text Request
Related items