Font Size: a A A

Research Of Topic Mining In Software Repositories With Applications In Software Maintainance

Posted on:2018-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:M YanFull Text:PDF
GTID:1318330536969459Subject:Software engineering
Abstract/Summary:PDF Full Text Request
There is a lot of software development and maintenance data during software evolution.The most typical kind of data is the textual data,such as source code,commit log,bug report,software document and mailing list.These data exist in different repositories.A lot of software development experience and knowledge can be mined from these data which can be used to guild different software activities.Topic model which is derived from NLP and information retrieving can mini the hidden semantic features in text.How to discover the hidden experience and knowledge from these textual data has attracted the wide attention of software engineering researchers.With the development of software development technology and the continuous change of software development mode,software requirements and complexity are promoting gradually,the importance of the software maintenance in software development and evolution process is increasingly prominent.In view of this,this thesis intends to concentrate on three kinds of textual data which coorelate with software maintainance: source code,commit log and bug report.Morever,this thesis aims to address three typical research questions in software maintenance: software change classification,bug triaging and software maintainability assesement.In summary,the content and novelty of the thesis are as follows:(1)The topic model is the most common technique in mining software textual data.This thesis is motivated by the drawbacks of original topic model: difficult to decide the number of topics,non-sparsity of topic distribution and difficult to introduce the label information.Based on the original PLSA,this thesis proposed a discriminative PLSA(named as DPLSA).In detail,the thesis designed a supervised initialization method which replace the random initialization method in original PLSA.As a result,the model can produce more discriminative topics which can caputure the semantic features better.This part provide the detail process and formal derivation of DPLSA.(2)In terms of software change classification,this section proposed a DPLSA based software change classification model.This thesis focuses on the commit log,training by adopting DPLSA.As a result,it can automatically learn the prababilisitc relationship between words and change categories.This model overcomes the difficulty in assigning a subjective weight to a word.In addition,the novelty of this model is that it can classify multi-category software changes and support cross-project analysis.This section describe the modeling process and conduct the empirical study of software change classification on five projects(i.e.,adopt Bugzilla,Wireshark,Boost,Firebird and Python as experimental datasets).In addition,this section describe the experimental setup,research questions and evaluation measures.At last,this thesis provide the experimental results and analysis.The results indicate that the proposed model outperforms four state-of-art baselines(i.e.,sLDA,First key,Na?ve bayes and L-LDA).(3)In terms of software bug component triaging,this section focused on bug reports,combined DPLSA and Jensen-Shannon divergence and proposed the DPLSA-JS triaging model.The novelty of this model is that it introduces the label information(component)in the topic modeling step which is different with the state-of-art LDA based method.As a result,it can generate a more discriminative topic represenatation and improve the bug triaging accuracy.This thesis describe the modeling process and the empirical study of software bug component triaging on five projects(i.e.,adopt Platform,Bugzilla,Mylyn,Gcc and Firefox as experimental datasets).This section describe the experimental setup,research questions and evaluation measures.At last,this part provide the experimental results and analysis.The results indicate that the proposed model outperforms two stateof-art baselines(i.e.,LDA-KL and LDA-SVM).(4)In terms of software maintainability assessment,this thesis focused on source code and propose a probabilisitc software maintianbility assessment model based on DPLSA.The novelty of this model is that it can learn the probabilistic correlation between source code,software metrics and software quality charasteristics from the benchmark.Tis section introduces the modeling process and the empirical study on 10 open source projects.In detail,this thesis describe the experimental setup,research questions and evaluation measures.At last,this part provide the experimental results and analysis.The results indicate that the proposed model outperforms the AWLE method.The thesis focused on the textual data in software repositories,addressed the three research questions in software maintainance: software change classification,software bug triaging and software maintainability assessment.In terms of the limitations of the existing methods,this thesis proposed new model and improved the accuracy which can provided more accurate suggestions for the decision-making in software maintenance and evolution.
Keywords/Search Tags:Mining software repositories, Software maintenance, Software change classification, Software bug triaging, Topic model
PDF Full Text Request
Related items