Research On Efficient Provenance Storage

With the development of information technology, people concerns not only the dataitself, but also need to know the origin and evolution of the data. These historyinformation of data is also known as provenance. In scientific research field, provenance iswidely used, because the data quality is extremely important for scientists. There are lotsof information systems that produce and collect provenance, including physical astronomy,chemistry, biology and Marine meteorological research fields. In addition,provenanceapplication in data reconstruction, debug tracking, safety and search areas also begin toappear. But nowadays in many provenance system, the provenance space occupancy is farmore than the data itself, which consumes too much resource, it is greatly affects theavailability and efficiency of the provenance system.In order to reduce the space occupied of provenance, and not affect the provenanceintegrity, Chapman puts forward the factorization and inheritance (FAI) algorithm. FAIjust extracts the common information from provenance nodes and optimize them. In thispaper,web dictionary encoding method not only extracts and optimizes commoninformation, but also optimizes the identity information of data itself, and at the sametime mining internal similarity of provenance nodes:use web algorithm to optimize thecode of provenance ancestors to further reduce the storage cost of provenance and ensureperformance of searching provenance information.This method is on the micro level. Andon the macro level, provenance quantity increases over time, leading to the infinite spacegrowth and inquiring time growth of provenance.According to this problem, this papertakes PASS system for example, dividing the provenance information, establishing index,compressing divided provenance files etc. Then use local principle of provenance data toimprove the storage and search mechanism of PASS. The experimental results show thatthe web dictionary encoding algorithm is better than the FAI algorithms both in storagespace occupancy, or the query time of identity or ancestral information; In theoptimization of PASS, the optimization method of dividing database, establishing index,compressing divided database files is better than the original method in the spaceoccupancy and inquired time.
