Font Size: a A A

Similarity Measure For Open Source Projects Based On Heterogeneous Graph Representation Learning

Posted on:2022-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:T Y ZhouFull Text:PDF
GTID:2518306479493924Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
In recent years,open source technology and open source software have profoundly affected all aspects of the society,forming the road and bridge of the modern digital world.The development of open source technology is closely related to the prosperity of the open source community.The sustainable development of open source software requires the support of stable contributors.Large-scale developers and open source software projects form a complex open source ecosystem.With the advent of the big data era,massive amounts of data have been accumulated in the open source production field,which is of great significance to the governance of open source communities.Similarity measure has important research value in the field of data mining.The similarity measure of open source projects is one of the sub-tasks of open source data analysis and community governance.It aims to automatically extract implicit similarity information between projects from data by similarity metrics.It also has important application value in tasks such as open source technology classification,open source group portraits,and open source proj ect recommendation.At present,there is no effective method for measuring the similarity of open source projects.Related research work is mainly focused on the code content and software dependency analysis.The open source community ecosystem follows the development model of distributed collaboration and contains significant social semantic information.Based on the historical event data during the evolution of open source communities,this work leverages the latent social semantics and proposes the open source project similarity measure method based on heterogeneous graph representation learning.The main contributions of this thesis are as follows:·Collecting historical data of the open source community.In order to reconstruct the evolution of open source communities,GitHub is taken as the target platform and historical event data is collected from two different data sources.Existing data sources contain too much redundant data.Then the historical event data as well as an infrastructure are designed to support efficient real-time online analysis and aggregation.·Propose heterogeneous information network schema and instance generation method.In order to represent the complex interaction patterns in the open source communities,a heterogeneous information network schema is designed to model the event data based on domain knowledge.Meta paths are introduced to capture structural and semantic information between nodes.In order to effectively construct heterogeneous network instances at any scale,a method for generating network instances is represented with the guarantee of robustness.·Propose a method for extracting similarity information of projects based on heterogeneous graph representation learning.Based on the complex interaction semantics of open source contributors,a weighted random walk sampling method constrained by meta path is introduced to extract structural and semantic networks to measure the similarity between projects.And different network datasets are used to conduct experiments.The results show that method is superior to other comparison methods in open source project clustering and similarity search.It also demonstrates the effectiveness of the open source software project similarity measure method based on the semantic measure of contributor collaboration through case studies.
Keywords/Search Tags:Heterogeneous Information Network, Open Source, Similarity Measure, Graph Representation Learning
PDF Full Text Request
Related items