With the rapid expansion of software scale and the frequency of software updates,program developers are facing increasing development and maintenance pressure in recent years.In the software development and maintenance process,a quick understanding of the relevant source code is a prerequisite for program developers to make code modifications.Compared with natural languages,programs are characterized by the large volume,strong structure,and abstraction,which can make them challenging to understand.Automatic code summarization technology can automatically summarize relevant code functions and output concise and accurate natural language summaries to improve software development and maintenance efficiency while helping program developers correctly understand source code.Currently,automatic code summarization techniques can be classified into three main categories: template-based,information retrieval-based,and deep learning-based.Template-based automatic code summarization techniques rely on pre-defined rules or templates,which are timeconsuming and difficult to obtain effective summaries when function names and custom identifiers are not appropriately named.Automatic code summarization techniques based on information retrieval are based on the developer’s a priori knowledge and take little account of the program’s structural information in the summary generation process.In contrast,deep learning-based automatic code summarization techniques can automatically learn program features based on existing data and are scalable.However,the shortcomings of current deep learning-based techniques for automatic code summarization are:(1)API information in the code is important for the semantic representation of the program and the call dependencies between APIs are more appropriately expressed as graphs,and existing techniques are rarely considered this information and model graphs accordingly.(2)Code is a mixed structure of syntax and semantics.Current techniques ignore the correlation between code semantics and syntactic structure in the process of single modelling features or multi-modal features of programs.(3)Since program scopes increase dependency intervals,most existing techniques have limitations in capturing long dependencies of programs through Seq2Seq’s serialized decoding approach.To address the above issues,a series of studies are conducted in this paper,which include the following three main aspects.(1)To address the problem of modelling code API information graph,this paper proposes a code semantic graph modelling method of local API call dependency graph(Local-ADG)to improve the ability to characterize code semantics.Local-ADG will extract API information in a single code snippet and construct a graph representation based on the call dependency relationship between APIs and input and output parameters.Local-ADG extends the program semantic representation to effectively express code semantic knowledge and improve the semantic program capability of representation.(2)To address the problem that the correlation between code semantics and syntax structure is not considered in the process of program feature knowledge extraction.This paper proposes a multi-modal program structure feature fusion method based on similarity network and a multimodal program structure feature fusion method based on attention mechanism.The correlation knowledge between program semantics and syntax is effectively extracted,and the overall program features are more comprehensively expressed,which can effectively improve the robustness of summary information.(3)To address the problem of long dependencies caused by program scopes,this paper proposes a Transformer-based program learning model,which completely relies on the attention mechanism to model the global dependencies of inputs and outputs.It can effectively capture long dependencies and is more interpretable,reduce the memory burden of the network,and improve the accuracy of natural language summarization. |