Font Size: a A A

Design And Implementation Of Code Clone Detection System

Posted on:2022-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2518306605489974Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,various kinds of commercial regular software have been entering people's lives.While formal software brings convenience to people,there are also many problems related to software security.In today's Internet environment,unscrupulous elements maliciously copy regular software code and crack regular software through reverse engineering,which bring great threats to enterprise software information security and also generate many problems related to software intellectual property.Binary homology detection is a good method to solve such problems,which can be performed from both source code and binary files respectively.Since the selected software is a commercial open source software,the source code is not available,so the binary file is chosen to be tested.In this paper,we design and implement a code clone detection system through the research and analysis of binary file matching technology.The main purpose is to find similar fragments of binary code and detect the similarity between regular open source software and software to be tested.The main work is as follows: First,the MD5 algorithm is used to obtain the program signature of the binary file for file-level comparison.The second is to use IDA disassembly technology to convert executable files into assembly code,construct function control flow graphs based on basic blocks and jumping relationships between basic blocks,compare the similarity between functions by extracting their Simhash values,and finally find the similarity between files based on the proportion of similar function blocks in the file.The system is developed based on Python for obtaining MD5 values of binary files and Simhash values of functions on the one hand,and springboot framework for developing web applications for comparing binary file level and function level on the other hand.The file level is used to quickly filter out the identical files by comparing the program signatures of the files,and for the files that are not identical,it is achieved by comparing the Simhash values between the functions.In order to improve the efficiency of function matching,the similarity condition of the traditional Simhash algorithm Hemming distance is improved,and the original 64-bit binary number is divided into four blocks instead of two blocks,and the condition of Hemming distance within 3 is similar code block is improved to Hemming distance within 1 is similar code block.According to the pigeon's nest principle,when the Hemming distance is within 1,there must be a block of codes in the database equal to the function to be tested,so that the index of the function to be tested is constructed according to every 32 bits,and the technical characteristics of the inverted index structure are used to find the database for comparison.This paper extracts the function features based on Simhash algorithm and combines the structural characteristics of the inverted index of elasticsearch to quickly achieve the comparison between functions in the massive data.The system selects open source industrial software as the sample for similarity matching with the software to be tested.By comparing the MD5 value of the binary file and the Simhash value of the function,the similarity between the software under test and the open source software is effectively detected.The experimental results show that both the current binary file level and function level comparison effectively detect the similarity between each open source industrial software and the unknown software to be tested,and the system works well and can meet the basic needs of users.
Keywords/Search Tags:binary file, MD5 algorithm, function feature, Simhash algorithm, similarity calculation
PDF Full Text Request
Related items