Font Size: a A A

Research On Multi-scale Multi-modal Source Code Summarization Technology Based On Program Feature Enhancement

Posted on:2024-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y X GaoFull Text:PDF
GTID:2568307058982019Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
Statistically,more than half of the time spent in software development and maintenance is devoted to programming comprehension and related tasks.In most of these tasks,developers use looking at comments to understand the meaning of the code.However,the writing of comments is often neglected in software development,resulting in poor-quality comments obtained by developers.The advent of code summarization technology,which generates short natural language descriptions for source code,not only frees developers from handwritten comments,but also improves software development efficiency and reduces software development costs.Existing code summarization methods can be divided into traditional and deep learning-based approaches.Among them,traditional code summarization methods include methods based on manually created templates and information retrieval.While these approaches have achieved some results,they have corresponding limitations: one is that they rely too heavily on naming conventions,and they won’t work without a given similar piece of code in the codebase.The second is that they depend on whether similar code snippets can be retrieved,and how similar those code snippets are.With the development of deep learning,more and more scholars use neural networks to generate summaries for codes.They exploit the powerful learning ability of deep learning to treat the code summarization problem as a Neural Machine Translation(NMT)task.However,the current work based on deep learning is still limited by the following shortcomings:(1)for the highly structured nature of source code,the code structure extraction techniques used in current methods suffer from the problem of incomplete feature acquisition and cannot guarantee its structural integrity and accuracy;(2)for the rich semantic information contained in the source code,most of the existing techniques only consider the lexical information contained in the source code sequence,but ignore the utilization of such semantic information as data flow and control flow;(3)for the existence of multiple modal features in the source code,the multi-modal fusion methods used by the current techniques only assign attention within the modalities and ignore the influence of the importance of inter-modal features on the generated summary,resulting in the inability to capture the correlation between modalities.In this thesis,a series of researches are conducted on the above problems,which mainly include the following three aspects.(1)For the problem of incomplete source code structure feature extraction,this thesis proposes a multi-scale feature extraction method based on the Abstract Syntax Tree(AST)to improve the completeness and accuracy of code structure information extraction.Here,multi-scale refers to the multiple power matrix obtained by the dot product of the corresponding adjacency matrix of the AST.The multi-scale representation of AST can be used to extract the source code structure features from multiple local and global levels,which can effectively capture the code structure information and improve the ability of the model to learn features.(2)For the problem that semantic information,such as data flow and control flow,is not considered in code feature extraction,this thesis proposes an AST representation with enhanced semantic features.For the source code parsed AST representation,the semantic information is enriched by adding multiple edges containing data flow and control flow.This kind of AST is called Enhanced-AST(E-AST for short).This AST graph structure integrates the semantic and syntactic information of source code,effectively preserves the semantic information in the source code,can more comprehensively characterize the programming knowledge,and improves the quality of generated summaries.(3)For the problem that source code multi-modal fusion cannot tap inter-modal feature correlations,this thesis proposes a method name-guided cross-modal feature fusion approach.For source code and AST modality,this method can highlight the semantic and syntactic structure information in AST based on fused features,and also learn the contextual correlation between code tokens.This approach can effectively extract features from different modalities,mine the correlation information of each modality in the source code,and improve the accuracy of summaries.
Keywords/Search Tags:Source code summarization, Abstract syntax tree, Graph neural network, Multi-modal fusion, Transformer
PDF Full Text Request
Related items