| In recent years,China's mobile network market is growing,which makes more and more harmful applications hidden.In order to avoid censorship,the description content of harmful applications that provide illegal functions is usually normal text.It is impossible to judge whether they are harmful applications only from the description information.However,some hidden words in the review information of harmful applications can be found.Therefore,the application text of China mobile application market needs to be studied carefully.However,due to the inability to discover and understand the mobile application code in time,it greatly affects the text corpus analysis of the application by the network security officers.It is found that mobile application metaphors are mainly composed of known words which have been included in dictionaries.These words express some special meanings different from their normal meanings in some contexts.In the field of natural language processing,for polysemous words,WSD can be used to get the correct meaning of words in a certain context.But at present,WSD can only disambiguate the included meaning,not the unknown meaning,so it can't be used in the discovery of Chinese hidden words.To solve the above problems,this paper proposes a method to discover the Chinese code for mobile applications and designs a system to discover the Chinese code for mobile applications and harmful applications.Firstly,it preprocesses the Chinese code for mobile applications in multiple ways according to the characteristics of the Chinese code for mobile applications,and based on the improved word2vec model,it can correctly disambiguate whether the meaning of the word in the code for mobile applications is the meaning of the code,so as to discover the code And then through the training of glove model to assist the understanding of the dark language,and finally use the dark language as a new text feature for the discovery of harmful applications.The specific research work of this paper is as follows:(1)Combined with the characteristics of corpus data,the existing new word discovery algorithms are studied,and the appropriate new word discovery algorithm is selected to process the experimental data,so as to improve the accuracy of corpus segmentation,improve the quality of word vector training,and increase the number of secret word discovery.(2)Based on the improved word2vec model,this paper proposes a word sense disambiguation method for mobile application of Chinese code words.According to the characteristics of data,this paper proposes an optimization method for word vector richness error,designs a code word discovery system,realizes the discovery of Chinese code words,and designs a code word semantic auxiliary interpretation module through glove model to assist the interpretation of code words.(3)Taking the found code as the new text classification feature,choosing the appropriate classification algorithm,designing and implementing the harmful application classification system,which can predict whether the application belongs to the harmful application or not through the applied text,has a high accuracy. |