Font Size: a A A

Outlier Detection Of JSON Document For NoSQL Database

Posted on:2022-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:L C LiuFull Text:PDF
GTID:2518306551970669Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the information technology developing rapidly,data-driven database development and innovation have significant effects in many fields,such as materials and biomedical fields.However,it is difficult for the traditional relational database to deal with the increasing data volumes and heterogeneous data characteristics.For example,in the field of materials,due to the diversity of materials disciplines,there is no unified way of data expression and recording among various materials,the application purposes and requirements of material data are also different,and the storage of data is becoming more complicated.The modeless storage and high scalability of NoSQL can be used to solve this problem.JSON,as a commonly used data storage format for NoSQL databases,is popular because of its simplicity and flexibility.Since NoSQL databases lack schema information,JSON documents need to be analyzed and verified before they are stored in the database.However,there still exists limitations in the structure and semantic analysis among the existing research methods.Take the citation of references for instance,the citation format of a paper may be different in structure under different standards,but if semantic analysis is conducted,it will be found that there are no significant differences.In this paper,the problem of structure outlier detection and semantic outlier disambiguation of JOSN documents is proposed.The contributions are summarized as follows:(1)The JSON document outlier detection model doctor JSON is proposed,which consists of 4 modules: JSON schema extraction module,JSON schema verification module,JSON document detection module,and classification module.The JSON document detection module is the core of the JSON document outlier detection model,which is used to detect structural outliers and eliminate semantic outliers in JSON documents.(2)For the problem of structure outlier detection of JSON documents,deout JSON and deout JSON+ are proposed,which are the rule-based JSON document structure outlier detection algorithm and the keyword deletion-based JSON document structure outlier detection algorithm.First,a formal definition of the structure outlier detection problem of JSON documents is carried out,and 3 types of data with structure outlier in JSON documents are defined: kv.key redundancy outlier,kv.key missing outlier,and kv.value type outlier.The rule-based structural outlier detection algorithm deout JSON is designed to accurately identify three structural anomalies in JSON documents.The deout JSON algorithm mainly includes three parts: keyword candidate set generation,keyword candidate comparison,and abnormal data generation.In the stage of keyword candidate set generation,an optimization strategy is proposed to sort the keywords in the keyword candidate set,effectively reducing the keyword comparison generated by the enumeration tree;in the stage of keyword matching candidate comparison,the enumeration tree technology is used to generate keyword comparison,based on the property of the JSON document,and an optimization strategy is proposed to ignore the cross-level keyvalue pairs in the JSON document,improve the efficiency of the algorithm,and perform outlier detection from different granularities to accurately identify and locate anomalous data;The stage of data generation provides two forms of outlier detection results,and the comparison of the detection results will be more intuitive and obvious.In order to eliminate the redundant comparison generated in the keyword candidate comparison phase,the structure outlier detection algorithm deout JSON+ based on keyword deletion is designed,which improves the efficiency of algorithm's structure outlier detection(3)For the problem of semantic outlier disambiguation of JSON documents,formalized definitions of semantic outlier in JSON documents are introduced;disema JSON,a semantic outlier elimination algorithm for JSON documents based on keywords similarity is designed to analyze the semantics of JSON documents with the similarity of word vectors,eliminating semantic outliers by replacing the semantic outlier items with words.The disema JSON algorithm mainly consists of 3 parts: keyword vectorization,keyword matching dictionary generation,and keyword replacement.Word embedding technology is adopted in the stage of keyword vectorization.Input the extracted keyword set into the word embedding model,and the output result of the word vector is obtained.Hash Map technology is used in the stage of keyword matching dictionary generation,storing semantic outlier items,observing at the features of semantic outliers,proposing the detection strategy of semantic outlier data: the disema JSON algorithm is executed on the redundant outlier in the JSON document to eliminate the semantic outlier in the JSON document.In the stage of keyword replacement,the keywords are matched with the semantically abnormal keywords in the dictionary,and Top-K ranking is used to select the most similar keywords in order to eliminate semantic outliers.(4)Detailed experiments on multiple real datasets and synthetic datasets are conducted to verify the effectiveness and efficiency of the deout JSON algorithm and deout JSON+ algorithm,verify the effectiveness of the disema JSON algorithm,and analyze the detection results analysis.
Keywords/Search Tags:NoSQL database, JSON document, Structure outlier detection, Semantic outlier disambiguation
PDF Full Text Request
Related items