Font Size: a A A

Research On Open-Source Npm Packages Classification Based On Text Analysis

Posted on:2022-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2518306758991689Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
As the package manager for the JavaScript language,npm(Node Package Manager)manages more than 2.5 million open-source third-party libraries—npm packages.Because there are no suitable classification methods for npm packages,the massive software resources are faced with the difficult problem of management and retrieval.In the open-source community where developers share and communicate npm packages with each other,custom tags play a certain role in classification while describing the functions.However,the current tagging mechanism has some problems such as mixed content and different expressions of synonyms,so that it cannot meet the needs of management and retrieval.Not only that,but more than 40% of packages in the npm community lack tags,and the large number makes it impossible to manually complete tags.For the above reason,this paper focuses on the two problems of category tag construction in the npm community and automated npm packages multi-label classification methods.(1)We propose a method for building category tags based on the associations between tags,which aims to build function-oriented taxonomy categories for npm packages.Firstly,the method uses the association rule mining algorithm to generate a tag association graph for tags in the npm community.Secondly,the tags are clustered based on the association relationship through the community detection algorithm,and several tag communities representing independent functions are formed.Finally,the tag community is filtered and merged manually,and we design a category tag recognition mechanism according to the influence of the tags in the community.This paper applies this method to the 8000 most depended upon packages in the npm community,and obtains 35 representative category tags.(2)We propose a multi-label text classification method for Readme documents to automate the classification of npm packages.This method firstly formulates a content segmentation scheme for extracting functional description information according to the content structure of Readme file.Secondly,a weighted keyword set is used to capture the semantic association between the classification information and category tags in the Readme document,which makes the method perform better than the traditional multilabel text classification method in classification accuracy.The method first constructs a keyword set for each category tag based on the topic-word distribution generated during the training of the supervised topic model L-LDA.Then,we use Word Mover's Distance algorithm to calculate the semantic similarity between the Readme document of the package to be classified and the keyword set of each category tag.Finally,the method assigns the category tag to the package according to the sorting result of similarity.After experimental verification,the multi-label text classification method for Readme document proposed in this paper can effectively classify npm packages in terms of the functionality.Against the multi-label classification methods as the baseline,the method in this paper has greatly improved the three multi-label classification evaluation metrics of Macro-F1,Hamming Loss and LRAP,which confirms the classification accuracy of the method.At the same time,the method also has good performance in the actual untagged packages classification,which confirms the effectiveness of the method.Furthermore,this paper builds a representative dataset for open-source npm package classification research.
Keywords/Search Tags:npm, open-source software classification, multi-label classification, text analysis
PDF Full Text Request
Related items