Font Size: a A A

Open Source Software Tagging Based On Weak Label Learning

Posted on:2017-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:L HanFull Text:PDF
GTID:2428330485466345Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Open source software is developing fast in recent decades.There are hundreds of thousands open source software projects in open source repositories.Software engineers usually want to search open source repositories to find solutions or reusable components.Thus,open source repositories often provide software projects with several tags,which are beneficial to software retrieval.In many open source repositories,tags are used to categorize,organize,and search software projects.Now,software tagging is a manual process.Human efforts are expensive and time-consuming.In open source repositories,some software projects have missing tags,and some software projects even have no tags.We do some research on open source software tagging based on weak label learning,and acquired following innovative results:First,Note that the number of distinct tags in open source repositories is very large.Thus,for each tag,projects with the tag are much less than projects without the tag.So for each tags,the information of positive examples are weak.Conventional multi-label methods tend to predict software projects with no tags.To solve this imbalanced problem,we propose a cost-sensitive multi-label learning method ML-CKNN(multi-label cost-sensitive k nearest neighbors).ML-CKNN combines cost-sensitivity with multi-label learning,so dassifier is more sensitive to positive class.Experiments on three open source repository data sets show that:ML-CKNN can predict high-quality tags to software projects.It is significantly better that comparing methods.Second,In the process of research,we find that some tags of a large number of software projects are missing.This is because users or managers of open source repositories are not familiar with tag sets since the number of distinct tags is very large.Conventional multi-label learning methods have an implicit assumption that tag sets of instances are complete.Thus,conventional multi-label learning methods are not suitable in this scenario.To solve this problem,we propose a software automatic tagging method TagWell(Tag based on weak label learning),which can recover missing tags of software projects.TagWell exploit correlations of software projects and tags to infer missing tags for software projects.Experiments on three open source repository data sets show that:TagWell can recover missing tags for software projects in open source repositories;the tagging performance is significant better that comparing methods.Last,in open source repositories,different open source software projects with same features may be assigned with different tags,such as synonyms or words with different word type.There are interrelationships between different tags,but the relationship is weak.Currently,open source repositories have not been able to take advantage of such a relationship between tags.To solve this problem,using multiple information source interaction,we can effectively learn the relationship between tags,the relationship between textual description,and the relationship between textual description and tags.Thus,the performance of open source software automatic tagging can be improved.
Keywords/Search Tags:Software Mining, Weak-label Learning, Multi-label Learning, Open Source Software Automatic Tagging
PDF Full Text Request
Related items