Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks

Posted on:2016-05-24

Degree:Ph.D

Type:Dissertation

University:The College of William and Mary

Candidate:Dit, Bogdan

Full Text:PDF

GTID:1478390017976647

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Textual or unstructured data generated during the software development process contains a significant amount of useful information that captures design decisions and the rationale of developers. One of the ways to exploit this information in order to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.) is to use Information Retrieval (IR) techniques (e.g., Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation, etc.). Two of the most important steps in a typical process of applying IR techniques to support SE tasks are: (i) preprocessing the corpus (i.e., a set of documents associated with a software system) by removing special characters, splitting identifiers, removing stop words, stemming identifiers, etc. and (ii) configuring the IR technique (i.e., setting up its parameters) and applying it on the preprocessed corpus.;In our previous work, we observed that the various options available for the preprocessing steps of the corpus (e.g., splitting identifiers), as well as the different parameter values for configuring IR techniques (e.g., configuring the parameters for LDA) can significantly influence the results produced by IR techniques on different datasets for various SE tasks.;This dissertation proposes the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support software engineering tasks. The approach named IR-GA determines the (near) optimal solution to be used for each step of the IR process. For example, for the corpus preprocessing steps our IR-GA approach will determine which special characters to remove, will choose the method to split the identifiers, will decide whether or not to remove stop words and how to stem identifiers. In addition, for the chosen IR technique it will automatically determine its (near) optimal parameter values. In an extensive empirical study, we applied IR-GA on three different software engineering tasks: (i) traceability link recovery, (ii) feature location, and (iii) identification of duplicate bug reports. The results of the study indicate that IR-GA outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach (i.e., one that knows the results a priori) and a combinatorial approach (i.e., one that considers a large number of parameter combinations and knows the results beforehand).

Keywords/Search Tags:

Software, Information, Tasks, IR techniques, Configuring, Approach, Results, Process

PDF Full Text Request

Related items

1	Research And Realization Of Procedure-oriented Configurable Quality Management Information System
2	Scheduling design processes with interdependent tasks: A systems analysis approach
3	Task-design Of Task-driven Approach In Vocational Teaching Of Computer
4	Inferring Semantic Information from Natural-Language Software Artifacts
5	Results and Techniques in Multiuser Information Theory
6	Design And Implementation Of Software Testing Standardization Process Management System For RSA Timing Attack Tasks
7	Data-Driven Software Development Process Mining And Analysis
8	An Agent-based Approach For Software Process Modeling
9	Efficient techniques for partitioning software development tasks
10	Integrated Structural Process Model: An inclusive non-material specific approach to determining the required tasks and information exchanges for structural building information modeling