Font Size: a A A

Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks

Posted on:2016-05-24Degree:Ph.DType:Dissertation
University:The College of William and MaryCandidate:Dit, BogdanFull Text:PDF
GTID:1478390017976647Subject:Computer Science
Abstract/Summary:
Textual or unstructured data generated during the software development process contains a significant amount of useful information that captures design decisions and the rationale of developers. One of the ways to exploit this information in order to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.) is to use Information Retrieval (IR) techniques (e.g., Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation, etc.). Two of the most important steps in a typical process of applying IR techniques to support SE tasks are: (i) preprocessing the corpus (i.e., a set of documents associated with a software system) by removing special characters, splitting identifiers, removing stop words, stemming identifiers, etc. and (ii) configuring the IR technique (i.e., setting up its parameters) and applying it on the preprocessed corpus.;In our previous work, we observed that the various options available for the preprocessing steps of the corpus (e.g., splitting identifiers), as well as the different parameter values for configuring IR techniques (e.g., configuring the parameters for LDA) can significantly influence the results produced by IR techniques on different datasets for various SE tasks.;This dissertation proposes the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support software engineering tasks. The approach named IR-GA determines the (near) optimal solution to be used for each step of the IR process. For example, for the corpus preprocessing steps our IR-GA approach will determine which special characters to remove, will choose the method to split the identifiers, will decide whether or not to remove stop words and how to stem identifiers. In addition, for the chosen IR technique it will automatically determine its (near) optimal parameter values. In an extensive empirical study, we applied IR-GA on three different software engineering tasks: (i) traceability link recovery, (ii) feature location, and (iii) identification of duplicate bug reports. The results of the study indicate that IR-GA outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach (i.e., one that knows the results a priori) and a combinatorial approach (i.e., one that considers a large number of parameter combinations and knows the results beforehand).
Keywords/Search Tags:Software, Information, Tasks, IR techniques, Configuring, Approach, Results, Process
Related items