Font Size: a A A

Research On Code Summary And Tag Automatic Generation Technology Based On Massive Open Source Resources

Posted on:2019-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:L B ZengFull Text:PDF
GTID:2428330611493644Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,the open source ecosystem has developed rapidly,and high-quality open source projects have emerged one after another.The rapid development of the open source ecosystem has brought a huge amount of open source resources.Faced with such a huge amount of open source resources,whether it can effectively provide abundant code description information for open source code,quickly and efficiently locate code fragments,so as to realize the reuse of high-quality open source resources,improve the efficiency of software developers,and promote the sustainable and positive development of the open source community.However,due to the large number of open source projects,rapid increase,frequent iterations,large number of open source code projects,complex structure,disordered organization and lack of specification documents,how to enrich the association between code fragments and natural language,thus promoting the reuse and dissemination of open source resources.It is a major challenge currently facing.In response to the above challenges,this paper focuses on the problem of the lack of labeling of open source code,researches the efficient collection technology of massive open source code data,and on the basis of obtaining large-scale open source data,studies the method based on deep learning code summary and automatic label generation.Providing strong support for the reuse and dissemination of open source resources.The specific innovations are as follows:1.Open source data collection technology for the GitHub community.The open source community contains a huge amount of open source resources,huge data and frequent iterations.The rich open source database is a key resource for efficient research by open source resource reuse and software engineering researchers.Based on this consideration,this paper uses the API interface provided by the GitHub community,combined with the software code repository protocol,to design and implement an algorithm that can continuously collect and update GitHub community data in real time.On this basis,in order to mine the software code repository In the software development history behavior data,we analyze the software code warehouse historical data through natural language processing methods.The effectiveness of our approach was verified through the continuous and stable collection of 40,000 open source code repositories and thousands of open source project data.2.Automatic Code Summarization technology based on deep learning.Open source projects have many contributors,poor levels,and lack of code annotations,which makes it difficult for software developers to quickly understand and reuse code snippets,making high-quality open source code difficult to reuse.We combine software engineering experience knowledge with deep neural networks to propose a method-level automated code summary method for open source projects.We used the top 5,000 Java projects in GitHub to extract more than 390,000 code snippets from natural language.The experimental results show that the accuracy of our summary generation is 32.5% higher than the current work,which verifies the validity of the model.3.Code tag generation technology for open source resource retrieval.The rapid development of the open source ecosystem provides software developers with a huge amount of software code snippets,but it also hinders software developers from using natural language description queries to locate code snippets.In order to promote the efficiency of locating code fragments through natural language description and improve the efficiency of code fragment multiplexing,we combine traditional TF-IDF method and recurrent neural network method to propose a method for generating code labels for natural language query statements.We use the top 5000 Java open source projects in GitHub to extract 700,000 sets of natural language-code fragment data.The experimental results show that the method has high accuracy in code tag generation,and the code retrieval accuracy is improved by 17.04%,which verifies the effectiveness of the method.
Keywords/Search Tags:GitHub, Data collection, Code summary, Code tags, Open Source Software
PDF Full Text Request
Related items