Font Size: a A A

Research On Robust Code Classification Based On Deep Learning

Posted on:2024-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:N Y ZhaoFull Text:PDF
GTID:2568306941995649Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has brought great convenience to people.The emergence of various open-source projects has reduced the cost of software development and improved efficiency.However,as the number of software projects increases,the quantity of malicious code has rapidly grown,and plagiarism has become a more serious problem,posing a significant threat to Internet security.Code classification plays a crucial role in maintaining Internet security by enabling the identification of the authors of malicious code and detecting potential code plagiarism.With the advancements in deep learning,existing code classification solutions have achieved high levels of accuracy.However,in real-world scenarios,imbalanced datasets and obfuscation attacks can severely degrade the performance of these code classification methods.Therefore,it is of great significance to develop robust code classification approaches.This paper investigates two problems related to code classification:imbalanced datasets and obfuscation attacks.(1)To address the issue of imbalanced datasets in code classification,this paper proposes a deep representation oversampling-based solution.In order to improve the quality of features extracted from the code,this solution utilizes a deep representation feature extraction method based on LSTM networks.By leveraging the advantages of LSTM networks in handling long sequences,it extracts deep representation features that better capture the characteristics of code authors from sparse and discrete code TF-IDF features.To tackle the problem of imbalanced datasets,this solution introduces a feature model-based oversampling method.This method learns the relationship between feature values extracted from deep representation features and synthesizes minority class samples to balance the dataset.Finally,a classification model is trained on the balanced dataset for code classification.The experiment results indicate that our approach has stronger robustness than existing methods,as it shows a smaller decrease in classification accuracy when the dataset imbalance increases.(2)To address the issue of obfuscation attacks in code classification,this paper proposes an obfuscation-resistant code classification solution based on multimodal deep learning.It extracts features from four perspectives of the code:binary files,code assembly files,code text,and code syntax structure.These features include binary image features,assembly instruction image features,TF-IDF features,and abstract syntax tree features.Binary image features exhibit robustness against data obfuscation and control flow obfuscation,assembly instruction image features are robust against data obfuscation,TF-IDF features are robust against layout obfuscation attacks,and abstract syntax tree features are robust against data obfuscation attacks.Due to the different structures of the extracted features,this solution employs a multimodal deep learning model to fuse these features.Firstly,it constructs sub-models based on convolutional neural networks and LSTM networks for different features and then extracts uniformly formatted feature vectors from the original features.Finally,the feature vectors are input into the fusion layer for integration.Experimental results demonstrate that the classification models trained using this approach exhibit strong resistance against obfuscation attacks.The imbalanced code classification approach based on deep representation oversampling achieves a classification accuracy of 95.92%in the dataset with the highest degree of imbalance extracted from the GCJ dataset,surpassing the best existing solution at 85.84%.The obfuscationresistant code classification approach based on multimodal deep learning attains a classification accuracy of 94.44%in the obfuscated GCJ dataset,exceeding the existing solution at 90.37%.Therefore,both code classification approaches proposed in this paper demonstrate strong robustness in their respective scenarios.
Keywords/Search Tags:Code classification, Imbalance, Obfuscation attack, LSTM, Multimodal
PDF Full Text Request
Related items