Font Size: a A A

Research On Cleaning Method For XML Similarity Duplicate Data

Posted on:2017-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:X X CaoFull Text:PDF
GTID:2278330482488705Subject:Industrial Economics
Abstract/Summary:PDF Full Text Request
With the continuous development of the information technology and the diversification of access to network data, a large amount of electronic data in the Internet have been generated, and the status of its application in various fields increases. Because of that, XMLdata is also increasing fast. As a semi-structured data, XML data is of very great importance in many application areas due to its own characteristics and advantages. But also, as semi-structured data, there are also many problems of data quality on XML data, so the cleaning task of ML data can’t be ignored. Since the self-described characteristics XML data has, the performance of its data formats are flexible and freedom which leading to existence of the repeated XML document, in particular of similarity duplicate data. These similarity duplicate data can generate a lot of redundant information.Currently many effective methods and tools used to analyze and manage data are only available to the simple analysis operations of XML data. They are generally passive to deal with the poor data effectively, and can’t solve the issues on the quality of XML data. Therefore, the retained information is just a small part of information the original XML data included after cleaning the poor data. So far, a large number of articles have study this issue on XML data cleaning, but most mainly pay attention to similarity duplicate data of XML cleansing. The problem of similarity duplicate data has always been the difficulty of data cleaning, and the difficulty lies in the identification to the similar duplicate data. The accuracy of the information obtained is also very significant. In order to reduce the time complexity and improve the efficiency of work on cleaning data, this paper optimize the algorithm for detecting similarity duplicate data of XML which will lay the foundation for the future research on data mining.In this paper, the problem of similarity duplicate of XML data and how to clean them are the focus part of this research. Definite similarity duplicate of XML data. First, classify the XML documents definition preliminarily using the method of path matching. Second, use a detection method named GA-PSO algorithm optimizing PSO algorithm detect the similarity duplicate part of XML data, and clean them according to the rules of the cleaning for the poor data. The optimized GA-PSO algorithm avoids the way of updating particles’ position in general particle swarm algorithm and does not update the particles by tracking their extreme value. For the purpose of effectively detecting the similarity duplicate part of XML data, this optimized algorithm introduced two operations(crossover and mutation) from genetic algorithm into PSO that do cross in particle and individual best and groups best as well as mutate by itself. Finally, the simulation experiments show that this optimized algorithm GA-PSO could detect XML similarity duplicate data with high performance in iterative convergence, and has higher regional stability than PSO whose calculation is poor. Especially in terms of time, GA-PSO algorithm has greatly improved, and reduces the workload with high efficiency.
Keywords/Search Tags:Data quality, Data cleansing, XML similarity duplicate data, GA-PSO algorithm
PDF Full Text Request
Related items