Font Size: a A A

Research On Related Technology Of Frequent Pattern Mining For Semi-structured Data

Posted on:2011-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Q YangFull Text:PDF
GTID:1118360308457801Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data mining has encountered many challenges. In order to mine those heterogeneous data and complex data, non-traditional data, such as the semi-structured text and hyperlinks to Web pages set, the sequence and three-dimensional structure of the DNA data, meteorological data of time series measurements, etc. , the algorithms need to consider the link relationship between data, which the self-correlation of time and space, graph connectivity, the link between the elements of semi-structured text and XML documents. Since tree and graph is the most common way of expressing entities, attributes, and the link between the entities through the nodes and edges, so they can describe the relationship between the majority of the material world, with a broad application base. At present, the mass data contains a large number of relationships with complex information, so how to mine these data also seems particularly important.In this paper, the related technologies for mining frequent patterns in structured data were studied. Focuses on the following issues: semi-structured data and constraint-based frequent pattern of semi-structured data; partially labeled graph data frequent pattern mining and the method base on weakening of support for constraint.Firstly, a new algorithm XMLMiner has been proposed based on the smallest subtree coding in general generalization, which calculates the least general generalization to find frequent subtree. The algorithm is directly constructed without multiple scans. Since adoption of least general generalization, it avoids tree matching operation. Compared with the traditional algorithm, it has higher efficiency.Secondly, we presented an algorithm for mining frequent embedded subtree, EXMLMiner. Algorithm is based on growth of frequent subtree coding sequence, through the path intersection operations to generate frequent items and then frequent subtree structure is compressed, reduced to a true subtree. Algorithm is based on the rightmost path of expansion in order to find all frequent pattern trees, since only the rightmost branch of the tree by adding new nodes to generate a new tree. The experimental results show that the algorithm is efficient.Thirdly, we presented an effective algorithm for mining frequent partially labeled subgraphs, PLSM. It directed to construction frequent patterns without generating candidate set and test operations; using depth-first, the most right-mode growth strategy to mine only in part of the labeled graphs, avoiding the dynamic tree construction. We used different filtering techniques to narrow the search space. Experimental and theoretical analysis shows that, the algorithm has a better time and space efficiency.Forthly, this paper presents a constraint-based frequent subtree mining algorithm, CTreeMiner. Based on the given constraint definition, pruning data through constraint simplicity in the pre-processing. Then, we set the weighted support based on the item constraint principle and checked if meets the constraint requirements according to the weighted support and frequency. It pruned the search space to reduce matching cost under the monotone, anti-monotone properties. Experiments showed that the algorithm has better time and space efficiency.Fifthly, this paper presents an algorithm based on support constraints of pattern weakening and approximate maximum independent set of frequent subgraph mining methods. The idea is to put support constrained forward to the mining process in order to make different subgraphs to match different support in the mining results. Performance of the algorithm is tested in the multiple manual dataset, and experimental results showed that the algorithm is effective and efficient.
Keywords/Search Tags:Data Mining, Frequent Patterns, Semi-structured data, Partially labeled graph, Constraint
PDF Full Text Request
Related items