Font Size: a A A

Mining Software Repositories For Bug Localization: Comparative Analysis Of Revised Vector Space Model And Pretrained Word Embeddings

Posted on:2021-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:VESNA BO?OVI?WNFull Text:PDF
GTID:2428330611999877Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The field of mining software repositories analyzes data present in software repositories in order to facilitate software development processes.Despite a plethora of data that exists in version control systems,bug tracking systems,communication archives,design requirements and documentation,researchers are facing challenges when utilizing it for analysis because of its highly unstructured nature.One of the tasks that mining software repositories practitioners are trying to solve is bug localization.Bugs in the source code can be difficult to localize.The process of manual bug localization is notoriously tedious and difficult,and developers spend a lot of time on it.Bug localization aims to automatically identify buggy source code files based on the bug reports.Even though there are plenty of automated techniques,this area has not yet reached its full potential and commercialization.Therefore,automated bug localization remains an open question and research communities have shown a great interest in it.With the recent developments in the field of natural language processing,many models have been proposed for embedding words in vectors.They are based on distributional hypothesis,in a way that proximity of meanings of words is represented by their proximity in a vector space.Such models allow us to measure semantic similarity between words by looking at distances between their vector representations.This thesis explores efficiency of pretrained models for word embeddings in a combination with information retrieval model,for the task of bug localization.Our model consists of two components: revised vector space model and pretrained word embeddings.We combine the individual rankings by minimizing the objective function and we define the final ranking as a weighted sum of the scores obtained from these two individual components.The dataset that we use consists of bug reports retrieved from Bugzilla and source code that is written in Java.Java code gives us the possibility to use abstract syntax trees parser to retrieve only a subset of structure fields from it.We perform two sets of experiments: the first one where we use the whole content of source code file to perform the localization,and the second one where we parse the abstract syntax trees of the source code and extract only several structure fields,such us class name,variable name,method name.Bug localization is dealing with data that has unstructured nature,such as bug reports,source code comments and identifiers,and the preprocessing method applied to source code files and bug reports has a large impact on the ranking results.By using different preprocessing techniques,the proposed model is evaluated upon its ability to retrieve a ranked list of source code files with respect to the analyzed bug report.Our key insight is that by extracting structure fields from the source code files,which is achieved through parsing their abstract syntax trees,it is possible to accomplish better bug localization.
Keywords/Search Tags:mining software repositories, bug localization, information retrieval, vector space models
PDF Full Text Request
Related items