Font Size: a A A

Design And Implementation Of Sensitive Information Detection Algorithm Based On Deep Learning

Posted on:2022-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:Q L DengFull Text:PDF
GTID:2518306338968219Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Github has become the most popular open source code hosting platform,and more and more developers and companies upload projects to the Github platform.Due to negligence and insufficient security awareness,developers upload code bases containing sensitive information to the public area of Github,which leads to the leakage of sensitive information and brings many security hazards.Therefore,the technical means that can effectively identify sensitive information in the source code are particularly important.important.In response to this situation,this paper designs and implements a sensitive information detection system based on Elasticsearch full-text search technology.On the basis of ensuring search accuracy and query performance,it can search for documents containing sensitive information from massive source files based on keywords.This article deeply researches Chinese word segmentation algorithms and sorting algorithms commonly used in search engine technology.Among them,the Chinese word segmentation algorithm includes three Chinese word segmentation algorithms based on string matching,word frequency statistics,and semantic analysis.The source text is segmented using the IK tokenizer implemented based on the string matching algorithm.The sorting algorithm includes IF-TDF algorithm,PageRank algorithm and BM25 algorithm.The principle,advantages and disadvantages of each sorting algorithm are analyzed,and the BM25 algorithm is improved to be applied to the search result ranking.Considering that the amount of data is relatively large and the data will continue to grow,this article uses the HDFS distributed file system to store source code data.The advantage is that the HDFS cluster is easy to expand,the storage capacity can be expanded by adding nodes,and the data is not easy to lose.The disadvantage is that HDFS is not suitable for storing a large number of small files.This article solves this problem by merging all the small files in a project into one large file.The Elasticsearch cluster is deployed as a search engine to provide full-text search functions.The source code files in the HDFS cluster are read and uploaded to the Elasticsearch cluster.During this period,a series of optimization schemes have been carried out to improve the index performance and query performance of the Elasticsearch cluster.Based on the requirements analysis,outline design and detailed design of the system,a sensitive information detection system was implemented using Spring Boot,Thmeleaf,MyBatis Plus,Layui and other technologies.Finally,the system was tested and the test results were in line with expectations.
Keywords/Search Tags:sensitive information, elasticsearch, hdfs, chinese word segmentation, ranking of search results
PDF Full Text Request
Related items