Font Size: a A A

Research Of Schema Extraction Algorithm Of Semi-structured Data Based On OEM Model

Posted on:2012-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:X W YangFull Text:PDF
GTID:2178330338493789Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the computer, database and Internet technology develop rapidly, the semi structured data and information from various areas has increased dramatically.the amount of semi structured data and information has increased dramatically from various areas. Therefore, we need to design a semi structured data model in order to meet the needs of data mining, which can use the model to describe semi structured data and store the structural information and content.At the same time we also need to design an effective extraction algorithm of semi structured data, the algorithm can extract model which is used to describe information, structure, and potential value in deeper level from a large number of semi structured data. So the semi structured data can be operated effectively by data structures just as conventional database, and we can found the data structure and the relationship between data objects of Semi structured data, so it can operate effectively by data structure.First, the paper introduces concepts about data mining and Web data mining, and analysis and summary precent research status and development deeply. The paper introduces the definition of semi structured data, characteristics and the major semi structured data model at present in detail; it illustrates the concept of schema extraction of semi structured data. In the paper, we used OEM model to describe the semi structured data, and pointed out to use the nature of Apriori to prune did not apply to the OEM which its branch paths contain the same label.In order to reduce the number of matching label path expression and improve the efficiency of the algorithm, in this paper we propose a nature of OEM model. About the storage of OEM model, the paper uses variant adjacency list to store OEM model, it can improve the efficiency of semi structured data schema extraction.Then the paper focuses on two classic frequent pattern mining algorithms, Apriori algorithm and FP growth algorithm, and made a comparison of their performance of the two algorithms. On this basis, in order to get the target model of semi structured data rapidly, effectively and accurately, by combining the related nature of label path in the paper, this paper proposes an algorithm that can extract target model from the OEM model of semi structured data directly. The basic idea of the Algorithm is: Using a Depth_First Search to get all of the label path expressions, with the help of the nature in this paper can reducing the number of path matching, we can generate all frequent label path expressions by layer. Finally, with the strategy of deletion we can get all of the longest frequent label path expressions effectively. Theoretical analysis and experimental result shows that this algorithm can improve the accuracy of target model and the efficiency, reduces the size of candidate sets in pattern extraction.
Keywords/Search Tags:Semi structured data, OEM, target model, the longest frequent label path
PDF Full Text Request
Related items