Font Size: a A A

Study On Semi-structured Data Mining

Posted on:2014-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:W LiFull Text:PDF
GTID:1228330395496903Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer networks, semi-structured data are widelyused as its hierarchy, readme, dynamic variability characteristics. The HTMLdocuments, XML documents, SGML documents, Web data, and the data generated bythe integration of heterogeneous data are semi-structured data.The encoding method of semistructured data is different from structured data(such as data in relational databases, Excel data, etc.). Faced with a flood ofsemi-structured data, traditional data mining algorithms can not be a good use forsemi-structured data due to its readme dynamic variability and hierarchicalcharacteristics. Traditional structured data processing technology is not applicable tosemi-structured data, therefore, necessary to study new methods for miningsemi-structured data.This paper demonstrated all the algorithms with XML and tree data. Accordingto the feature of semi-structured data, this paper research the frequent pattern,dynamic frequent pattern, clustering, dynamic XML clustering and national fabricpattern, and proposed some solution for semi-structured data mining.With the widely use of semi-structured data, frequent tree pattern mining becomea research hotspot. This paper proposed a compression list structure that can representan unordered labeled tree, and then proposed a compression tree model which basedon compression list, the label of each node of compression tree is a compression list,so compression tree can be compressed and lost no information. what is more, thispaper proposed a closed induced frequent subtree mining method with datasetcompression that named CITMinerC. CITMinerC algorithm belongs to pattern-growthstrategy. CITMinerC run cut-edge reprocess first, and then iterate the dataset tocompress a edge to a compressed node and save the information of edge to thecompression list of compressed node, and finally CITMinerC obtain the frequentsubtree set by using closure of compression lists. The advantage of CITMinerC is thatthe computation costs grow linearly with the frequent threshold decrease. Experimentresults show that the CITMinerC algorithm is better than DryadeParent algorithmwhether using artificial dataset or real dataset. In practice, XML data changes itself frequently, according to the dynamiccharacteristic of XML, this paper proposed a method to mining SFCS (SpatialFrequently Changing Structure) from historical structural changing process of XMLdata, first, we proposed a method to measure the space of XML substructure, wemeasured the spatial changing frequency by using structural spatial changing degree(SSCD), version spatial changing degree (VSCD) and spatial changing degree (SCD),and we proposed the definition of SFCS. Further, we proposed a data model calledSC-DOM which used to store XML change information and discover SFCS, and wedemonstrated the effect of editing operations to space of substructures of XML anddefined the maintenance method of SC-DOM, finally, we proposed the algorithm tomining SFCS base on SC-DOM and discuss the complexity of the algorithm.Experimental results show that the SFCS is a frequent, as well as mining SFCS basedon the SC-DOM is effective and scalable.Processing and managerment of semi-structured data is a hot research topic.However, Previous similarity detecting methods on XML ignored the characteristicsof XML layer layer, that layer does not affect the similarity of XML data. This paperproposes a layer-sensitive XML documents collection clustering method CXLI. Wefirst proposed structural tabel to clear up duplication structures, and then proposed theconstraints of editing operations. Finally, we proposed a XML clustering method byusing agglomerative hierarchical clustering method. Experiment executes on ACMSIGMOD data set and synthetic data set, experimental results show that CXLI hasbetter precision under the similar time cost.In practical applications, some structures of an XML document are often changed.In order to mining knowledge hiden in the freduently changing structures in the XMLdocument history changes, this paper propose a method to found the frequentlychanging structures, then uses a document-vector model which composition by a setof frequently changing structures to represent an XML document, to proportion thatfrequently changing structures appearance in the cluster as weight, and cluster XMLdocuments use weighted cosine similarity. After experimental analysis, according tofrequently changing structures which found in the XML document historical changeprocess will be better able to cluster XML documents. Cluster XML document usingthe weighted cosine similarity, the precision rate, recall rate and cluster internaldistance of cluster result are all better than the result obtained by use non-weightedcosine similarity. so using weighted cosine similarity to clustering XML documents is valid.The fabric pattern of Chinese national minority contains a large number ofnational cultures. Currently, most all the fabric patterns using a bitmap format to store,and accurate data mining is a big difficulty. This paper first discusses the nationalfabric pattern shape characteristics, proposed the vector gene model with culturalinformation. Further, discuss the composition of the genetic structural mode of thenational fabric pattern, and proposed a national fabric pattern model based onsemi-structured data model, the model describes the relationship between the nationalfabric pattern shape information, cultural information and gene information. Theexperiments show that the proposed model can correctly complete descriptionexample to Kazak, Kirgiz, Mongolian and Uygur fabric patterns, and have the abilityof data mining.
Keywords/Search Tags:Data mining, Semi-structured data, Frequent pattern, Dynamic frequent pattern, document clustering, National fabric model
PDF Full Text Request
Related items