Research On Code Search Based On Statistical Semantic Analysis

Posted on:2021-10-23

Degree:Master

Type:Thesis

Country:China

Candidate:C J Du

Full Text:PDF

GTID:2518306476460164

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

How to improve the efficiency of software development,ensure the quality of software and reduce the cost of software development are the three core issues in the field of software engineering.With the development of the Internet and the popularity of open source software,a large number of high-quality source codes have appeared on the Internet.Reusing these codes efficiently becomes one of the effective ways to improve the efficiency of software development.In terms of software reuse,code search has become a frequent activity in the daily development process of software developers;in addition,code search technology is often used as an important supporting technology for other software development technologies,such as code recommendation technology and code completion technology.The traditional code search methods are mainly based on information retrieval,relying on keyword matching between natural language queries entered by users and code fragments.These methods lack the understanding of the code semantics,and are prone to mismatching of terms due to the heterogeneity of code and natural language,which leads to problems such as low search accuracy and insufficient practicability.A code search method is proposed in this paper based on statistical semantic analysis,using statistical methods to mine the deep statistical semantics of code from massive code resources,so as to better support the matching between source code and natural language queries,and improve the accuracy of code search.In particular:(1)In terms of semantic analysis of source code,for the issue that the semantics of source code is difficult to extract,describe and use,this paper extracts the necessary code features from the method level and class level to model the source code,and then constructs a code embedding network based on multifeature modeling to extract deep statistical features of source code.(2)In the aspect of natural language query and source code matching,this paper breaks through the traditional keyword-based matching method.A code search model is built based on joint embedding technology,mapping the code and natural language query into a unified vector space,and matching through cosine similarity.(3)In order to support efficient code search process,a code search dedicated is also built in this paper based on the above search method to construct the special code search repository.And the corresponding code search tool SCSM(Semantic Code Search Model)is designed and implemented,which can search the code snippets that is semantically related to the natural language query statement input by the user,and meet the needs of code reuse.In order to evaluate the code search method based on statistical semantic analysis,the validity effectiveness evaluation experiments and influence factor analysis experiments are carried out on 1,263,974<code,natural language description> pairs of data sets.The experimental results show that:(1)Compared with UNIF,the latest deep learning-based code search method,the SCSM proposed in this paper improves the TOP-1 accuracy and MRR of the search by 20% and 0.15,respectively.(2)Compared with the method of analyzing code as a natural language text,the multi-feature modeling method of code can effectively improve the accuracy of code search.

Keywords/Search Tags:

Code Search, Software Reuse, Code Embedding, Statistical Semantics

PDF Full Text Request

Related items

1	A Code Description Semantics Vector Based Java Code Search
2	Research On The Attack And Defense Techniques Of Code Reuse
3	Research And Implementation Of Automatic Code Summarization And Retrieval Technology For Open Source Reuse
4	IO Example-based Code Search With Functional Semantics
5	A Research On Code-reuse Attacks And Detection Techniques
6	Research On Key Techniques Of Software Binary Code Reuse
7	Research On Code Search Technology Based On Features Of Code And Comment
8	Research On Code Reuse Attack Protection Technique Based On Virtual Machine Monitor
9	Research And Implementation Of Semantics-based Approach For Binary Code De-obfuscation
10	A qualitative study on the performance of R-code statistical software