Mining unstructured software repositories using IR models

Posted on:2014-08-31

Degree:Ph.D

Type:Thesis

University:Queen's University (Canada)

Candidate:Thomas, Stephen W

Full Text:PDF

GTID:2458390005490869

Subject:Computer Science

Abstract/Summary:

Mining Software Repositories, which is the process of analyzing the data related to software development practices, is an emerging field which aims to aid development teams in their day to day tasks. However, data in many software repositories is currently unused because the data is unstructured, and therefore difficult to mine and analyze. Information Retrieval (IR) techniques, which were developed specifically to handle unstructured data, have recently been used by researchers to mine and analyze the unstructured data in software repositories, with some success.;Next, we show how the use of advanced IR techniques can improve results. Using a framework for combining disparate IR models, we find that bug localization performance can be improved by 14--56% on average, compared to the best individual IR model. In addition, by using topic evolution models on the history of source code, we can uncover the evolution of source code concepts with an accuracy of 87--89%.;Finally, we show the risks of current research, which uses IR models as black boxes without fully understanding their assumptions and parameters. We show that data duplication in source code has undesirable effects for IR models, and that by eliminating the duplication, the accuracy of IR models improves. Additionally, we find that in the bug localization task, an unwise choice of parameter values results in an accuracy of only 1%, where optimal parameters can achieve an accuracy of 55%.;Through empirical case studies on real-world systems, we show that all of our proposed techniques and methodologies significantly improve the state-of-the-art.;The main contribution of this thesis is the idea that the research and practice of using IR models to mine unstructured software repositories can be improved by going beyond the current state of affairs. First, we propose new applications of IR models to existing software engineering tasks. Specifically, we present a technique to prioritize test cases based on their IR similarity, giving highest priority to those test cases that are most dissimilar. In another new application of IR models, we empirically recover how developers use their mailing list while developing software.

Keywords/Search Tags:

IR models, Software, Unstructured, Data, Using

Related items

1	A Study And Application Of The Management Of The Unstructured Data(mud) Based On Xml
2	Design And Implementation Of The Middleware System For Unstructured Textual Big Data
3	Digital Oil Field Of Unstructured Data Management System Design And Implementation,
4	Research On The Unstructured Data Ontology And Relevant Algorithms
5	Unstructured Financial Data Management System Design And Implementation
6	Research And Application Of Techniques For Collection And Retrieval On Unstructured Data
7	Based On The HDFS Unstructured Data Retrieval Technology Research And Application
8	Research On High-end Manufacturing Unstructured Data Management
9	Study On The Transformation Echnologies Of Unstructured Data In Enterprise Content Management
10	Research On Large-scale Unstructured Data Processing Of Index And Visualization