Font Size: a A A

Duplicate Detection Of Pull Request In GitHub Community

Posted on:2021-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhengFull Text:PDF
GTID:2428330611951365Subject:Software engineering
Abstract/Summary:PDF Full Text Request
GitHub's development model with mass participation has a profound impact on software development technology,due to its highly streamlined project management model,it leads to chaotic project management.The Pull Request module is used to record the user's changes to the code function.After submission,you need to manually review whether you accept the change.For projects with a high degree of attention,there is a problem of duplication among the large number of Pull Requests submitted.Existing research shows that Pull Requests account for about 40% of the duplicate relationships on GitHub.Duplicate Pull Request manual detection is very difficult,and will also delay the development and maintenance of the software.The study proposes two methods for automatic detection of repeated Pull Requests.Firstly,for the text description part of Pull Request,a text similarity repeated detection based on the explicit features of the text and the implicit topic model is proposed.In this paper,the BM25 method is used to extract the dominant features in the text,and the LDA algorithm is added to extract the topic features to mine the semantic information of the text.Secondly,for the code submission part of Pull Request,the code similarity detection based on Java and Python is proposed.The abstract syntax tree is used to characterize the syntax structure of the code and other information,and the abstract subtree is constructed to feature vectors to reduce the vector During the linear search time,a local sensitive hash algorithm is used to approximate the nearest neighbor search,and the similarity of the code is obtained by statistically analyzing similar code fragments in the submitted code.Finally,this paper proposes to combine the text similarity and code similarity models of Pull Request to detect duplicates.Based on the method proposed in this paper,the three features of Rails,Angular.js and Elasticsearch are tested for repeated text similarity and hidden subject model text similarity.This method is improved on Recall-rate @ 20 compared to the existing method.1.36%.Based on Java,Python,Scikit-learn and Elasticsearch were tested for code similarity repetitive detection and combined text similarity and code similarity repeated detection,and found Recallwhich combined text similarity and code similarity repeated detection The average Recall-rate @ 20 is 63.63%,which is an increase of 4.49% compared to existing methods.
Keywords/Search Tags:Pull Request, Duplicate Detection, Software Maintenance, Topic Model, Abstract Syntax Tree
PDF Full Text Request
Related items