Font Size: a A A

Augmentation Of Pre-Trained Model For Programming Language Based On Structure Information

Posted on:2024-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:X X JiangFull Text:PDF
GTID:2568307067493034Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development and popularization of Internet technology have led to the emergence of code intelligence,and code representation learning,as an important foun-dation of code intelligence,has received widespread attention from academia and industry.As both code and text are represented as sequences of characters,many studies have ap-plied text representation learning techniques,particularly pre-training techniques,to code representation learning,achieving impressive results in various code-related tasks.Com-pared to text,code contains various structure information that can provide rich seman-tic knowledge to improve model performance.However,existing methods for enhancing pre-trained code models based on structure information suffer from two issues:(1)explicit introduction of structure information in the pre-training stage results in input differences between pre-training and fine-tuning stages,limiting model performance;(2)fine-tuning stage methods do not fully model the relationship between multiple structure information and the target task,while also ignoring the relationships between different structure infor-mation.To address these issues,this paper follows the pre-training paradigm and conducts research around pre-training and fine-tuning stages.The main contributions and research work of this paper are summarized as follows:The first contribution of this work is introducing code structure information during the pre-training stage.Existing methods incorporate structure information as model inputs during pre-training,where code sequences and structure information interact directly to establish connections.However,considering the parsing cost,the input typically only in-cludes code during fine-tuning.To address this issue,this paper proposes Syntax-Guided Pre-Trained Model for Programming Language(SGBERT).Specifically,to ensure con-sistency between pre-training and fine-tuning inputs,the paper uses structure informa-tion as labels to guide model training.Three new pre-training tasks are proposed based on Abstract Syntax Tree and Code Identifier,including Span Boundary Prediction,Node Attribute Prediction,and Variable Prediction.SGBERT outperforms the existing state-of-the-art methods in multiple code-related tasks while maintaining the same parameter count and pre-training corpus,fully demonstrating the effectiveness of the proposed method.The second work is to use structure information to assist in solving code-related tasks during the fine-tuning stage.This paper focuses on defect detection and proposes Multi-View Pre-Trained Model for Code Defect Detection(MV-PTM).MV-PTM constructs the multi-view representation of the code based on three types of structure information: Ab-stract Syntax Tree,Data Flow Graph,and Control Flow Graph,and achieves more accurate defect detection based on the code’s multi-view representation,improving the modeling of the relationship between structure information and defect data in existing methods.In addition,considering the semantic consistency between different structure information,MV-PTM uses contrastive learning to build an auxiliary task based on multi-view to en-hance code representation.Experimental results on public defect detection datasets show that MV-PTM can significantly improve the accuracy of defect detection.In summary,this paper studies the Augmentation of Pre-trained Model for Program-ming Language Based on Structure Information.The structure information is introduced to enhance code representation during both pre-training and fine-tuning stages.The ef-fectiveness of the proposed approach is validated on multiple code-related tasks.
Keywords/Search Tags:Deep learning, Code representation learning, Pre-trained models, Vulnerability detection, Contrastive learning
PDF Full Text Request
Related items