Font Size: a A A

Fraud Detection On GitHub Using Heterogeneous Graph Neural Network

Posted on:2022-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:J S LinFull Text:PDF
GTID:2480306752954299Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The GitHub community is world's largest open source community.The number of Star and For is the number of times that a repository has been concerned and used by users,which largely reflects the quality of the repository.However,there are cheating promotion services for GitHub developers: service providers use a batch of bot accounts to provide large amounts of paid Stars and Forks for particular repositories.These accounts mimic the behavior and frequency of normal accounts to avoid the detection mechanism of GitHub.If this gray industry chain cannot be effectively detected,it will destroy the harmony ecological environment of the entire open source community.Aiming at the detection of GitHub cheating promotion service,this paper studies the construction of data set and in-depth learning method,and explores the improved methods.There is no authoritative dataset available for the detection of GitHub cheating promotion service.This paper constructs GitHub repository network dataset from the original data obtained from GitHub event log archive database.Firstly,according to the heterogeneity of node and edge of GitHub network,the network is considered as a heterogeneous information network,and its characteristics and structure are studied by related methods.Then,with repository nodes as the main node,a 28-dimensional feature to describe repository event log is established by using repository attributes,and the statistical features of its event log.The 28-dimensional feature of each repository is combined with three meta-paths to define similarity vectors between projects to evaluate the similarity from multiple semantics.Then use the threshold parameters used in the stage of building the dataset as variables,the optimal threshold parameters are selected through comparative experiments,so that the dataset retains most of the information while minimizing the volume.In the experimental section,this paper introduces the attention mechanism based on the isomeric graph convolution network and the meta-path method,and combines the hyper-graph generation method with the isomeric graph convolution network to propose a new isomeric graph convolution network model.The attention mechanism can balance the weight of semantics in the isomeric information network dynamically,and the hyper-graph generation method can solve the problem of poor connectivity due to the high proportion of mini-graphs in the dataset.Then,to verify the validity of the proposed method,two comparative experiments are performed using the proposed model and other classical graph-based neural network models:First,using benchmark datasets such as DBLP,Cor as input to evaluate the effect of the proposed model on the benchmark dataset.The experimental results show that the isomeric graph neural network model is slightly better than other graph neural network models on these datasets.The second is to use GitHub dataset as input to evaluate the effectiveness of the proposed model in solving the detection problem of GitHub cheating items.The experimental results show that the isomeric graph neural network model can detect cheating items effectively,and the effect is better than other graph neural network models: the accuracy of the test set is improved by about 3%.Finally,the model proposed in this paper is applied to more GitHub project lists,and a list of cheating projects is obtained,which proves its practical value.
Keywords/Search Tags:GitHub, promotion-as-a-service, heterogeneous graph convolutional network, meta-path, hyper-graph generation
PDF Full Text Request
Related items