| At present,we are in the era of big data.How to protect the security of big data,especially sensitive data,is a direction worthy of more attention.In big data,text data is the main data.Most of the sensitive data detection tools in the market only use the sensitive word matching algorithm to retrieve the sensitive words of the text content,which will lead to a very high misjudgment rate of sensitive data,and need to invest a lot of labor cost to further screen the suspected sensitive information.Therefore,based on the idea of "data available and invisible",this paper proposes an intelligent auxiliary identification mechanism of sensitive data,that is,on the premise that the original data does not leave the data holder,machine learning algorithm is used to detect the sensitive information of the original data,so as to improve the accuracy of sensitive information judgment.The main contributions of this paper are as follows.(1)A text readability judgment model based on Ada Boost algorithm is proposed.The model judges the text readability by constructing two basic classifiers with different text feature extraction methods,so as to ensure that the text has readability in the subsequent judgment of sensitive text.The experimental results show that the accuracy of text readability determination based on Ada Boost algorithm is higher than that of basic classifier 1 and basic classifier 2,and the accuracy is more than 80%.(2)A sensitive text judgment model based on context semantics is proposed.The model detects sensitive data for the text judged to be readable.Among them,this paper improves the existing matching algorithm,increases the processing of sensitive words containing Pinyin,and uses word2 vec model to associate context semantics,so as to judge the sensitive text.Based on the text classification corpus provided by Fudan University,this paper compares and analyzes the effect of matching algorithm and sensitive text judgment based on context semantics through experiments.The experimental results show that the accuracy of sensitive text judgment based on context semantics reaches 87%,which effectively reduces the misjudgment of sensitive information caused by matching algorithm only matching sensitive words according to rules without considering context semantics.(3)Based on the research and analysis of text readability judgment model and sensitive text judgment model,focusing on text reading,text readability detection and data sensitivity detection,this paper completes the demand analysis,overall design,main function realization and test of the prototype system of intelligent auxiliary recognition tool for sensitive data. |