| Oil and gas exploration and development data has the characteristics of wide business scope,large data volume,various data types,deep historical accumulation,and complex relationship between data.Data mining technology can be used to mine its internal connections,but only if the data sources used are authentic and reliable.Data lineage analysis technology can help data analysts quickly locate the source and processing process of problem data,which is of great significance to data quality control and credibility management of oil companies.This thesis analyzes the domestic and foreign research status of lineage relationship analysis theory,lineage relationship analysis model,lineage relationship analysis method and visualization of lineage relationship,and focuses on the following contents in view of the existing problems at the current stage and the actual needs of oil companies.1.Research on pattern-level data lineage analysis methods.This thesis uses the data lineage analysis method based on the SQL abstract syntax tree to solve the of difficult data lineage analysis due to the complex and diverse forms of SQL statements.Firstly,the lack of definition of SQL abstract syntax tree is added.Secondly,provide a data lineage analysis algorithm,construct a physical structure file of data lineage relationships,and form a subgraph of lineage relationships.Thirdly,data lineage fusion methods researched to form full-link data lineages.Finally,in the experimental analysis stage displayed the lineage relationship diagram.2.Research on tuple-level data lineage analysis methods.This thesis proposes Prov EmbX,an improved method for tuple-level data lineage relationship analysis based on word embedding,to solve the problem of large storage overhead in the tagging method.Firstly,the tuple vectorization encoding mechanism is studied,and the tuple lineage relationship is identified according to the tuple vector similarity.Secondly,an optimization algorithm based on field importance is proposed to improve the accuracy of lineage relationship analysis.Thirdly,an approximate nearest neighbor search algorithm is introduced and an optimization mechanism for tuple filtering is proposed to reduce the time complexity of lineage relationship analysis.Finally,a comparative experiment is carried out to verify the improvement effect of the method in this thesis,and a directed acyclic graph is used to display the lineage relationship of tuple-level data.3.Research on data lineage relationship visualization methods.Based on the PROV model published by the W3 C,this thesis investigates the lineage relationship hierarchical automatic layout algorithm and data lineage relationship fusion method to help in the clear presentation of lineage relationship.Firstly,a PROV model is proposed and the definition of data lineage relationship graph is given.Secondly,define the layout constraints that conform to the data lineage domain to study the automatic layout algorithm.Finally,an experiment is conducted combined with the PROV model to show the comparison results of layout experiment.4.Designed and implemented a data lineage analysis system.Based on lineage analysis algorithm,lineage fusion algorithm and lineage hierarchical automatic layout algorithm,this thesis developed data lineage analysis system,realizes data lineage analysis service,data lineage management service,and data lineage visualization service,the implementation of this system can help track the path of data flow within the oil company,facilitate the maintenance of relationships between data,and is important for corporate data governance. |