Privacy policy is a type of text that an organization or business declares that it collects,uses,and shares user personal information.With the continuous emphasis on data security and personal information protection in the past two years,APP privacy compliance has become a hot topic of discussion.The current manual evaluation of APP privacy policies by domestic regulators is time-consuming and labor-intensive.Therefore,the use of artificial intelligence technology to drive the analysis of privacy policies has become a new subject for APP compliance assessment that needs urgent research.Among them,the named entity recognition task of the privacy policy is of great significance for further analysis of its sentence structure,entity relations,etc.However,due to the lack of relevant report literature for this research work,this article takes "Chinese Privacy Policy" as the research object and conducts exploratory research from the perspective of named entity recognition.First of all,this article starts with the task requirements and develops a data set for Chinese privacy policy named entity recognition.By using the literature research method,on the basis of summarizing the current research status of privacy policy texts,the problems of privacy policies in different industries in China,such as poor readability and lengthy text,are summarized,and then the method of manual labeling and machine learning algorithmbased construction is proposed.The required data set.The main contributions are as follows: firstly,we obtained raw corpus through the Huawei application market and designed appropriate Chinese privacy policy entity labeling specifications;then combined with the Chinese privacy policy evaluation indicators of the regulatory agency,we confirmed the type of entity to be labeled after confirmation by experts,and based on the “ The "Chinese Privacy Policy Text Annotation Tool" formed by the second development of the "YEDDA" tool for entity annotation;in order to improve the speed and convenience of sentence classification,this article proposes a keyword pre-screening method for predicting based on the relationship between keywords and sentence topics The processed corpus text is classified.Experiments verify that the stacking method using integrated learning has better classification performance than a single classifier,but the cost is relatively high.Subsequently,this article analyzes the potential relationship between keywords and sentence topics,and proposes to construct a Chinese privacy policy sparse matrix and use the singular value decomposition method to calculate the similar distance from the cosine similarity for potential semantic analysis.The experiment verifies that the keywords identified by the latent semantic analysis of the sentence using the singular value decomposition method basically correspond to the keywords selected by the keyword pre-screening,and the effectiveness of keyword selection is verified.In the study of the Chinese privacy policy named entity recognition method,this paper uses a conditional random field(CRF)model based on statistical machine learning and proposes to construct a PRI-BI-LSTM-CRF neural network method to solve the Chinese privacy policy NER task.Conditional random field model as the baseline method of research needs to be converted into a conditional random field model training test format through tools.Based on CRF research,this paper designs experiments from the perspective of domain characteristics to compare the influence of different text input granularity and window size on the recognition results.In the neural network-based method,the PRI-BILSTM-CRF neural network model framework has been experimentally verified that the F1 value of the six types of entities has reached an average of 79.55%.At the same time,in order to solve the problem that the lack of Chinese privacy policy data annotation set affects the accuracy of recognition,this paper combines transfer learning to propose the transfer of Privacy-Specific word embedding through pre-training parameters to build a Trans-PRI-BILSTM-CRF neural network to improve Chinese privacy policy NER The accuracy of the task.In summary,this paper analyzes the naming of the Chinese privacy policy by designing four sets of characteristic factor influence analysis experiments and comparing the three recognition methods of CRF,PRI-BI-LSTM-CRF,and Trans-PRI-BI-LSTM-CRF based on word granularity marking.Factors affecting the accuracy of entity recognition and different methods have improved the effectiveness of solving the Chinese privacy policy NER.According to the experimental data,the F1 value of the Trans-PRI-BI-LSTM-CRF method under the word granularity mark is used to solve the Chinese privacy policy named entity recognition problem,and the average F1 value can reach 79.92%,which can better improve the accuracy of recognition,and also has a certain method feasibility. |