Font Size: a A A

Research On XML Data Management For Retrieval And Classification

Posted on:2016-03-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T WuFull Text:PDF
GTID:1318330512971799Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,web services technology is gradually mature and there are a large number of semi-structured data represented by XML on the Internet.As a data storage and exchange format,XML plays an important role on the Internet,in fact,it has become the de facto standard of data exchange and used widely in e-commerce,e-government,financial and other aspects.However,with the increasingly demand in network applications,in the face of massive XML data,how to obtain the information from the massive data,how to efficiently store and manage the semi-structured data,how to obtain the potential valuable information from these complex data,or how to classify or cluster these massive data,all these problems expect to be solved.So the data management based on search and classification has become a hotspot problem what people eager to solve.XML is a kind of extensible markup language released in 1998 by W3C,because of its good expansibility,self description and independency of platforms,which has been used largely on the Internet,the wide application of XML makes efficient XML data management be a pressing demand.In recent years,research on XML is also growing,the scope of study also involves many aspects:from the basic XML encoding to the query of XML keyword;from XML data storage to the classification of XML document;from the distributed processing to the access control;from the physical storage to the safe transmission,from the efficient query to the clustering of XML,which involve numerous and complex aspects increasingly.In this article,from the point of view of data search and classification in data management,it focuses on the research about the encoding,search,XML feature representation and similarity detection related to classification,the main research results and innovations are listed as follows:(1)According to the special structure of XML document,an XML document labeling scheme with ring-shaped structure is proposed,which use the ring-shaped structure to organize the sibling nodes and can update the nodes dynamically.In order to test its performance,a static analytical approach and some test documents are used to compare with another schemes in storage size and dynamic updates.The experiment result shows that the ring-shaped structure encoding scheme solves the problems of high cost and low efficiency of the existing scheme for XML documents effectively.(2)In the process of transmission or exchange in Internet,the user has to finish the restricted query while XML document contains sensitive information.According to this case,a keyword security search algorithm is designed.Based on the instance information tree with access control policies,firstly extracting the main information's policies,then reversing to act on the instance information tree and saving the special node policy,this method with compacted policies provides important basis for secure and efficient xml keywords query,in addition,we adopt the extensible Dewey coding which makes the query easy.The experimental results show that while it has few keywords and low frequency,comparing the compressed algorithm with uncompressed algorithm,this compressed algorithm can save about 66%search time at the best time,and can save about 10%search time at the worst time.(3)Facing with massive online semi-structured data,in order to solve the focus problem of efficient and accurate classification for various XML documents,a new feature representation method based on the PCA theory analysis is proposed,which is based on the XML set edges representation and the full path characteristics.In order to decrease the space dimensions,we adopt the PCA dimension reduction technology,which makes the dimension decreased from several hundred to ten.On the basis of the feature representation and PCA dimension reduction technology,KNN is used to complete the automatic effective classification of XML documents.(4)In order to complete XML documents' classification with supervised learning,an XML similarity detection algorithm based on matrix storage is proposed,this method firstly extracts the main structure of XML document,and then uses a matrix to represent the structure.In the matrix,the storage position represents this XML's structure,and the storage content is on behalf of this XML's semantic.So this method makes the XML similarity detection change to be the matrix's similarity detection,on the other hand,this method also gives consideration to the structure and semantics of XML documents.In order to prove the effectiveness of the proposed algorithm,the nearest neighbor classifier is used to finish the automatic classification of XML documents,and its classification accuracy is more than 98%.
Keywords/Search Tags:Mass Data, XML Encoding, Keyword Search, Similarity Calculation, XML Documents Classification
PDF Full Text Request
Related items