Font Size: a A A

Research Of Duplicate Bug Reports Detection Based On LDA

Posted on:2014-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Z JiangFull Text:PDF
GTID:2268330392472153Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Bug report is the defects description data generated during the softwaremaintenance cycle. Since these reports are typically written in a hurry by the users thatunderstand software itself very little, there exists not only expressed vague, notprofessional, the information was incomplete, difficult to understand and so on, alsoexists that the same defects are repeatedly submitted. These problems result in thepresence of a large number of redundant duplicate bug reports, if such duplicate bugreports are repeatedly assigned to developers, it will inevitably cause serious waste ofhuman resources, and especially such issues are particularly evident for large-scale opensource projects.In order to reduce the burdens of manual duplicate bug reports detection, manyexperts and scholars at home and abroad made themselves into the detection ofduplicate bug reports research and made a series of duplicate bug report detectionmethod. However, due to the traditional method of automatic detection of duplicatereports commonly uses vector space model as a theoretical foundation, and vectorsbuild by it has many problems such as high-dimensional, data sparse and data withnoise and other issues. All these problems reduce the efficiency of detection, recall andprecision rate is low. To solve these problems, this thesis proposes a new method basedon the topic model theory, Latent Dirichlet Allocation model is the most simple topicmodel, With LDA model, bug reports can transfer from the traditionalhigh-dimensional words space into to the low-dimensional topics space, and finallycalculate the similarity between documents in the low-dimensional topics space, thusgreatly reducing the dimension of the space to be processed and improving theefficiency of detection algorithms.The main work is as follows:1. Through a lot of research on the relevant literature, we analyze the researchbackground and overseas and domestic research status, and make clear that the currentproblems and the corresponding solutions in the field.2. Through scientific analysis of the bug reports’ distribution status, build thesample space of this experiment, and in the space extract the needed bug reports data,then pretreatment the experimental data, pretreatment mainly includes two steps: database cleanup and data depth cleanup, thus ensuring the validity and reliability of the experimental data.3. Simulate the traditional method for duplicate bug reports detection as acomparative experiment. First introduce the vector space model theory, and thenanalyze the feature item selection and weight calculation approach, finally adopt thevector space model to calculate the similarity of bug reports and make an evaluation onthe experimental results.4. For the drawbacks of traditional methods, complete the duplicate bug reportdetection experiment based on Latent Dirichlet Allocation. Firstly, the experiment useLDA algorithm to build topic model, secondly construct experimental testing samplespace to facilitate verification experiment results, thirdly separately calculate theexecution information similarity and the classified information similarity, fourthlyweight the sum of the two similarities and get the final similarity of bug reports, andfinally the results were evaluated.Experimental results show that duplicate bug reports detection based on LatentDirichlet Allocation can solve many shortcomings such as high-dimensional, high noiseand so on that the traditional method usually have, while adding the executioninformation and classified information can greatly improve the accuracy of results.
Keywords/Search Tags:duplicate bug reports detection, Latent Dirichlet Allocation, Vector Space Model
PDF Full Text Request
Related items