Font Size: a A A

A Study Of Data Sanitization For Enhanced Sensitive Patterns

Posted on:2024-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2568307064996719Subject:Computer technology
Abstract/Summary:PDF Full Text Request
On the Internet,in order to provide various services,a large amount of data needs to be publicly released.But the release of such data may expose sensitive patterns that model confidential knowledge.To secure the private information,it is crucial to remove the sensitive patterns from the data.Data and sensitive patterns are usually represented as strings.The process of eliminating sensitive patterns from data strings is called string cleaning.However,existing algorithms such as TPM describe sensitive patterns with strings,which have little descriptive power to efficiently model complex sensitive information and cannot meet the demand of sensitive data protection.To solve this problem,this paper proposes a more flexible definition of sensitive patterns that can describe a larger range of sensitive information for more situations,and presents the corresponding sensitive pattern cleaning algorithms.Specifically,this paper proposes the SSWW(Sensitive Pattern Sanitization With Wildcard)problem,which imports the wildcard $ into the sensitive pattern description($ denotes an arbitrary character).By adding wildcards,a pattern can describe a collection of sensitive patterns.In this paper,we propose a data sanitization algorithm SSWW-ALGO algorithm for this pattern,which can effectively remove sensitive patterns without affecting the order and frequency of non-sensitive patterns.If the length of origin data is set to n and the length of sensitive patterns is set to k,the time complexity of the SSWW-ALGO algorithm is O(kn)and the space complexity is 0(n).The experimental data show that the factor that affects the running time of the algorithm the most is k,and the running time of the algorithm increases slightly with the increase of k.Second,this paper proposes the PHDC(Pattern Hide with Don’t Cares)problem,which extends the sensitive patterns into a series of sequentially occurring strings,called VLDC(Variable Length Don’t Cares)patterns.Previous studies on pattern matching of VLDC only detect whether the VLDC pattern exists in the string,while this paper detects all VLDC patterns that exist in the string.In this paper,we design two data cleaning algorithms for VLDC patterns,hideSequenceOne and hideSequenceTwo.If the length of source data is set to n and the length of VLDC patterns is set to k,the time complexity of hideSequenceOne algorithm is O(kn~2),and the space complexity is O(n).And hideSequenceTwo algorithm focuses on the protection of data integrity and availability during data sanitization.The experimental data show that hideSequenceOne is 174.8%faster than hideSequenceTwo in terms of running time,but hideSequenceTwo is better than hideSequenceOne in terms of guaranteeing data integrity and availability.
Keywords/Search Tags:Data sanitization, Pattern Matching, Sensitive pattern, VLDC Pattern
PDF Full Text Request
Related items