Font Size: a A A

Mining and modeling the open source software community

Posted on:2008-01-26Degree:Ph.DType:Dissertation
University:University of Notre DameCandidate:Xu, JinFull Text:PDF
GTID:1448390005968751Subject:Computer Science
Abstract/Summary:
The success of Open Source Software (OSS) has attracted increased interest in many research areas. Unlike proprietary closed software, OSS projects are developed in a distributed and decentralized way. The OSS community is largely composed of part-time developers. These developers have developed a substantial number of outstanding technical achievements. A research study on how OSS developers interact with each other and how projects are developed will help researchers understand the success and failure of OSS projects. OSS developers can also benefit from this research, by being able to make more informed decisions for participating on OSS projects.; In this dissertation, we address the challenge of efficiently mining data from OSS web repositories and building models to study OSS community features. Data collection for OSS study is nontrivial since most OSS projects are developed by distributed developers using web tools. Most previous studies focus on manually creating a web crawler to collect data from OSS web sites. This method is usually implemented by creating a web crawler based on specific research goals. We design a mining process which combines web mining and database mining together to identify, extract, filter and analyze data. We address and analyze the difficulty of mining OSS data. Our work provides a general solution for researchers to implement advanced techniques, such as web mining, data mining, statistics, and algorithms to collect and analyze web repository data.; Based on our mining results, we model the OSS community as a social network, one which can be further modeled as a project network and a developer network, and study properties of these networks. Our goal is to find intrinsic mechanisms that lie in OSS networks to explain some OSS specific features such as roles of developers, communication, and reliability of the OSS community. We construct four social networks for the OSS development community at SourceForge [59]. Each social network is created by expanding the number of people with different roles in the network, moving from the core project leaders, to the core developers, to the co-developers, and finally out to active users. Social network properties such as degree distribution, diameter, cluster size, and clustering coefficient are calculated and compared for each of the expanding social networks. We elaborate on how the changing topological characteristics of the social networks may signify important capabilities for the diffusion of information, the ability to find collaborations, and the overall robustness of the OSS development community. We further find that all the social networks have scale-free properties, and the inclusion of the co-developers and active users triggers the emergence of the small-world phenomenon for the social network. We examine how these topological network properties may potentially explain the success and efficiency of OSS development practices.; To study the organization and backbones of the OSS community, we conduct the identification of the community structure on the SourceForge project network. We find that groups exist in the SourceForge project network. Furthermore, we explore possible reasons for the formation of those groups by examining assortative mixing coefficients for projects categories. Among them, we find projects with same programming languages, operating systems and topics are more likely to be grouped together. Our research provides useful information to study the interaction between projects and the communication and information flow in OSS virtual organizations.; We simulate the OSS community based on four social network models: random graphs, preferential attachment, preferential attachment with constant fitness, and preferential attachment with dynamic fitness, using two tools---Repast and Swarm. Our simulation models are fit to data from year two in the history of SourceForge. To prove the correctness of our simulations, docking experi...
Keywords/Search Tags:OSS, Community, Mining, Software, Data, Network, Projects are developed, Web
Related items