Research Of Topic Mining In Software Repositories With Applications In Software Maintainance

Posted on:2018-07-28

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Yan

Full Text:PDF

GTID:1318330536969459

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

There is a lot of software development and maintenance data during software evolution.The most typical kind of data is the textual data,such as source code,commit log,bug report,software document and mailing list.These data exist in different repositories.A lot of software development experience and knowledge can be mined from these data which can be used to guild different software activities.Topic model which is derived from NLP and information retrieving can mini the hidden semantic features in text.How to discover the hidden experience and knowledge from these textual data has attracted the wide attention of software engineering researchers.With the development of software development technology and the continuous change of software development mode,software requirements and complexity are promoting gradually,the importance of the software maintenance in software development and evolution process is increasingly prominent.In view of this,this thesis intends to concentrate on three kinds of textual data which coorelate with software maintainance: source code,commit log and bug report.Morever,this thesis aims to address three typical research questions in software maintenance: software change classification,bug triaging and software maintainability assesement.In summary,the content and novelty of the thesis are as follows:(1)The topic model is the most common technique in mining software textual data.This thesis is motivated by the drawbacks of original topic model: difficult to decide the number of topics,non-sparsity of topic distribution and difficult to introduce the label information.Based on the original PLSA,this thesis proposed a discriminative PLSA(named as DPLSA).In detail,the thesis designed a supervised initialization method which replace the random initialization method in original PLSA.As a result,the model can produce more discriminative topics which can caputure the semantic features better.This part provide the detail process and formal derivation of DPLSA.(2)In terms of software change classification,this section proposed a DPLSA based software change classification model.This thesis focuses on the commit log,training by adopting DPLSA.As a result,it can automatically learn the prababilisitc relationship between words and change categories.This model overcomes the difficulty in assigning a subjective weight to a word.In addition,the novelty of this model is that it can classify multi-category software changes and support cross-project analysis.This section describe the modeling process and conduct the empirical study of software change classification on five projects(i.e.,adopt Bugzilla,Wireshark,Boost,Firebird and Python as experimental datasets).In addition,this section describe the experimental setup,research questions and evaluation measures.At last,this thesis provide the experimental results and analysis.The results indicate that the proposed model outperforms four state-of-art baselines(i.e.,sLDA,First key,Na?ve bayes and L-LDA).(3)In terms of software bug component triaging,this section focused on bug reports,combined DPLSA and Jensen-Shannon divergence and proposed the DPLSA-JS triaging model.The novelty of this model is that it introduces the label information(component)in the topic modeling step which is different with the state-of-art LDA based method.As a result,it can generate a more discriminative topic represenatation and improve the bug triaging accuracy.This thesis describe the modeling process and the empirical study of software bug component triaging on five projects(i.e.,adopt Platform,Bugzilla,Mylyn,Gcc and Firefox as experimental datasets).This section describe the experimental setup,research questions and evaluation measures.At last,this part provide the experimental results and analysis.The results indicate that the proposed model outperforms two stateof-art baselines(i.e.,LDA-KL and LDA-SVM).(4)In terms of software maintainability assessment,this thesis focused on source code and propose a probabilisitc software maintianbility assessment model based on DPLSA.The novelty of this model is that it can learn the probabilistic correlation between source code,software metrics and software quality charasteristics from the benchmark.Tis section introduces the modeling process and the empirical study on 10 open source projects.In detail,this thesis describe the experimental setup,research questions and evaluation measures.At last,this part provide the experimental results and analysis.The results indicate that the proposed model outperforms the AWLE method.The thesis focused on the textual data in software repositories,addressed the three research questions in software maintainance: software change classification,software bug triaging and software maintainability assessment.In terms of the limitations of the existing methods,this thesis proposed new model and improved the accuracy which can provided more accurate suggestions for the decision-making in software maintenance and evolution.

Keywords/Search Tags:

Mining software repositories, Software maintenance, Software change classification, Software bug triaging, Topic model

PDF Full Text Request

Related items

1	Software Change Classification Based On Probabilistic Latent Semantic Analysis
2	Research And Implementation Of Defect Change Management Method Of Software Based On Software Process
3	Research On Fundamental Prediction Problems In Software Maintainance
4	Fxoms System Software Maintenance
5	An Investigation Of The Relationship Between Software Bug Severity And Bug Fixing Change Complexity
6	Research About Software Defect Prioirty Prediction Model Based On AdaBoost-SVM Algorithm
7	Supporting software maintenance by mining software update records
8	Mining Software Repositories For Bug Localization: Comparative Analysis Of Revised Vector Space Model And Pretrained Word Embeddings
9	Research On Metric-based Software Maintenance Process Management
10	Assessing Software Maintainability Based On Class Diagram Design