In most computer security schemes,identity authentication is used as the first line of defense to ensure network information security.Compared to various authentication methods such as fingerprint recognition and iris recognition,text passwords are still widely used due to their simple implementation,low cost,and easy expansion.However,With the Internet advancing rapidly,the security risks of text passwords have become more apparent.On the one hand,with the growing number of passwords that users are required to remember,there is a tendency for users to utilize simple transformations to create new passwords from previously used ones.On the other hand,the frequent occurrence of website password database leaks has led to more diverse attacks by hackers,which have greatly threatened the security of users’ privacy data and property security.It can be seen that studying password security has very important significance for ensuring the security of Internet information.Guessing passwords is a significant area of study in password security research.An excellent password guessing model often better measures the security of a single password or password dataset,in order to provide effective suggestions for website managers and users.At present,the mainstream methods are Markov model and probabilistic context-free grammar model based on probability statistics,which rely heavily on the division degree of password structure and the accuracy of probability,and cannot capture many deep semantic information contained in passwords.On this basis,researchers have conducted semantic mining(word segmentation and part of speech tagging)of letter segments in passwords,and made further improvements using natural language processing technology.The disadvantage is that the number segment in a password also has rich semantic information,such as the possibility that the number segment in the password starts with “314…” followed by “159” is far more likely than “666” and “123”.How to use the digit semantics in the password to improve the efficiency of the model’s guess has also become a new issue.At present,many scholars have introduced machine learning and deep learning into the password guessing model,and proved its feasibility through experiments.In order to solve the above problems,this paper focuses on three aspects of password security technology based on large-scale real password data sets,using deep learning technology,including vulnerability analysis of large-scale real passwords,extraction of digital semantic features in passwords,and password guessing model based on variational auto encoder and bidirectional LSTM.The main research contents of this paper are as follows:1.Analysis of vulnerability characteristics of large-scale real passwords.Current research neglects the impact of user language on password creation when analyzing password datasets,and lacks analysis of differences in user passwords across different languages.Therefore,this paper analyzes more than 6 million passwords from six datasets of Chinese,English and German websites,and then divides these password datasets based on the native language of Internet website users to study the similarities and differences of password vulnerability characteristics of Chinese,English and German users,including popular passwords,password structures,password language dependency,and password length distribution,Finally,it was confirmed that the language and culture of users can have a significant impact on password creation.For example,the passwords of Chinese group users are mostly in pure numbers,while the passwords of English group users are mostly in the form of letters and numbers,while German group users are very fond of using words as passwords.Meanwhile,experiments have shown that analysis based on letter frequency can help infer users’ language.2.A semantic extraction method for password numerals based on Sentence Piece is proposed.Current research lacks the division of semantic categories in password numeric segments.Therefore,based on the idea of sub word model in Natural Language Processing(NLP)technology,this paper uses the word segmentation model Sententiece to segment,classify,and induce semantic categories from password numeric segments.By analyzing the semantic characteristics of numbers,six types of digital semantic tags are summarized: ordinary number sequences,meaningful number sequences,communication number sequences,date number sequences,keypad number sequences,and multi tag number sequences.Subsequently,a method for extracting digit semantics based on Sentence Piece was proposed,and experimental analysis revealed the important impact of digit semantics on password creation,verifying that most digital segments in passwords have certain arrangement rules.3.A password guessing model based on VAE and Bi-LSTM is proposed.At present,there is often a lack of effective application of digit semantic in the research of password guessing model.Therefore,taking advantage of the unstructured characteristics of passwords,this paper uses the variational auto encoding model in deep learning to train and guess text password sequences,and takes the bidirectional LSTM as the core module,finally proposes the Pass VAE model.Furthermore,Pass VAE-D is obtained by improving the model by using the digital semantics in the password.The experimental results show that each model has higher matching accuracy under one-site test than under cross-site test.Moreover,compared with the classical probability model,the improved model PASSWAE-D has more advantages in large number of guesses,and has higher matching accuracy than Pass GAN in one-site testing and cross-site testing. |