Font Size: a A A

Research On Author Attribution Discrimination For Source Code

Posted on:2022-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2518306542481174Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Code author attribution is the process of identifying the author of a given code.With the emergence of more and more malware and advanced mutation technology,a large number of variants of malware have been created,and to find the methods of identifying the author of malicious code has become a hot spot.The features of author style in malware can be used to predict the types of tools and technologies used by authors of specific malware,as well as the ways in which malware spreads and develops.Code author attribution technology can be used to identify and classify malware authors,and select more obvious code style features and more efficient deep learning methods,which is of great significance to the identification of code authors.At present,machine learning based on PDG features for source code author attribution has gradually become a research hotspot.In the process of feature collection,although the data and dependencies in the program were included,the analysis of the coupling degree of the whole program was lacking,and the problem of imbalanced data sets in a small number of samples still can be found,which would exert a great impact on the accuracy of code author attribution in specific cases.Therefore,it is necessary to study the attribution of code authors that fuse different types of features and balance data sets.In addition,due to the unbalanced data of different types in the collected malicious code,it is particularly important to extract the features from a small number of samples.In view of the lack of the whole program coding level measurement,the research started from the feature selection direction.Firstly,principal component analysis was used to extract and quantify the source code coupling degree of an author.Secondly,the feature vectors of Produce Dependency Graph(PDG)with control flow and data flow characteristics were extracted,and then the PDG features after weight analysis and coupling degree were fused to form feature vectors with more obvious style.Finally,the proposed Coupling Program Neural Network(CPNN)model was used for training and testing.It can be observed from the experimental results that the fused features can better reflect the programmer's style,and the improved code word vector network model performance was also better than other deep learning models.An accuracy rate of 97% in different types of source code datasets of 1000 authors was achieved.To solve the problem of imbalanced data features among C++,Java and C#,a Synthetic Minority Over-sampling Technique Recurrent Neural Network(SMOTE-RNN)was proposed to discriminate imbalanced source code data sets.Firstly,the N-gram features with the weight of Term Frequency-Inverse Document Frequency(TFIDF)were extracted;secondly,based on the similarity with the primary and secondary class samples,new minority class samples were synthesized to balance the number of the three types of features;finally,the input samples were tuned slightly and optimized by using the recurrent neural network to get the prediction results.The training accuracy of smote algorithm after data processing was much better than that without data balance,and the optimal results were reached soon.A high accuracy rate of 90%in the unbalanced dataset of 1000 programmers was achieved with this method.
Keywords/Search Tags:Code Authorship Attribution, Coupling Degree, SMOTE, Produce Dependency Graph, N-gram
PDF Full Text Request
Related items