Font Size: a A A

Program Semantics Understanding Via Machine Learning

Posted on:2020-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:L F QinFull Text:PDF
GTID:2428330602451051Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Program semantics understanding is essential to many works,such as program analysis,bug detection,malware detection and so on.However,it is a tedious and time-consuming procedure,thus we need automatic approaches for programming understanding.Machine learning can learn relevant information from large amount of source code,and it has become a research hotpot.If prior experience and knowledge about source code can be incorporated into the designing of machine learning model,the semantics of source code can be understood better by the model.Compared to plain text of source code,control flow automata and abstract syntax tree can more intuitively represent the structural features of the programming language.In this thesis,we use control flow automata,abstract syntax tree and program graph as the representation of source code,and design machine learning models to learn the different aspect information of the source code,and obtain the multidimensional vector representation of the source code.The main work of this thesis can be summarized as follow:(1)Using control flow automata as the representation of source code,a machine learning model is designed based on the Weisfeiler-Lehman graph kernel and Doc2 vec model,into order to learning the control flow information of the source code.After training,the model has learned the multidimensional vector representation of the source code.In order to check whether the model has learned the control flow information,a classifier based on the learned vectors is built and trained to perform source code classification,in contrast with other existing machine learning model.The result showed that,due to the limited information that the control flow automata can show,the model can learn only part of the necessary information,failing to reach our goal.(2)As abstract syntax tree contains more detail information about source code,so we use it as the representation of source code,and design a machine learning model to learn the grammar information of source code.After training,the model has learned the multidimensional vector representation of source code.In the source classification task,this model has achieved an accuracy of 97.93%,which is an increase of the previous model,and it is higher than the existing models,showing that this model can effectively learn the control flow of the source code.Besides,the clustering result showed that this model can both identifier similar tokens and differentiate one from those with high disparity.(3)In order to further improve the learning ability of the model,a program graph is designed to combine control flow automata and abstract syntax tree,and a representation learning model is built on the graph to learning both the syntax and control flow information of the source code.After training,the model has learned the multidimensional vector representation of the source code.In the source code classification task,this model has achieved an accuracy of 98.94% which is the higher than the previous model.Besides,in the similar source code query task,this model has also achieved the highest accuracy.At last,the visualization experiment result showed that the source code representation vectors learned by this models have good separability.All of these results showed that program graph is more useful for the representation learning model to extract the source code information.
Keywords/Search Tags:program semantics understanding, source code processing, representation learning, abstract syntax tree, control flow automata
PDF Full Text Request
Related items