The booming open source ecosystem has attracted more and more developers to open source activities and accumulated a huge amount of open source code,and reusing open source code for software development has become an important way to improve the efficiency of software development.Developers hope to improve the efficiency of software development through code reuse,which may have a negative impact on software quality and can cause hazards such as vulnerability propagation,unauthorized plagiarism,and potential threats to software maintainability and consistency.Therefore more and more researchers are working on automated code clone detection techniques in the hope of reducing the harm caused by code cloning in the open source ecosystem.Based on the massive amount of open source code in the open source community,this paper explores efficient code clone detection methods around issues such as performance effectiveness and execution efficiency of clone detection techniques.The main work and contributions of the paper are summarized as follows.First,this paper focuses on the scalability and time efficiency of code cloning detection methods,and provides an in-depth analysis and summary of the existing work.Through the statistics of the current research on clone detection in terms of time efficiency and scalability,we deeply analyze the concerns,solution ideas and development trend of clone detection technology at this stage,dig into the two key aspects that affect the efficiency of code clone detection-code representation and detection matching process,and summarize the The key technologies and technical correlations are summarized,and the performance of various tools under different data sets is sorted out,providing technical support and directional guidance for developers and researchers.Second,an efficient code clone detection method AKuC based on unsupervised clustering is proposed.In this paper,we extract semantic information from code data by auto-encoder,combine unsupervised clustering for coarse grouping,and then use NSG(Navigating Spreading-out Graph)algorithm to further reduce the number of matches in detection and achieve fast clone detection for large-scale codes.Experimental verification of the efficiency and scalability of the algorithm was conducted on the Bigclonebench benchmark library and the IJaDataset large Java project library,and the results show that AKuC can achieve 99.16% accuracy,and the recall rate in MT3 and WT3/T4 has significant advantages over other unsupervised detection methods,and can be smoothly extended to 250 MLOC code volume detection.Third,a distributed clone detection system CCEyes for large-scale open source code is designed and implemented.In this paper,we collected a total of 3.2BLOC codes from two open source communities,Trustie and GitHub,at home and abroad,and built a largescale open source code semantic representation corpus by deep learning method,and used HBase to store the corpus efficiently.CCEyes can store large amount of data at low cost and support high concurrent real-time queries,providing a code checking service platform for developers and researchers. |