Research On Open-Source Npm Packages Classification Based On Text Analysis

Posted on:2022-11-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2518306758991689

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

As the package manager for the JavaScript language,npm(Node Package Manager)manages more than 2.5 million open-source third-party libraries�npm packages.Because there are no suitable classification methods for npm packages,the massive software resources are faced with the difficult problem of management and retrieval.In the open-source community where developers share and communicate npm packages with each other,custom tags play a certain role in classification while describing the functions.However,the current tagging mechanism has some problems such as mixed content and different expressions of synonyms,so that it cannot meet the needs of management and retrieval.Not only that,but more than 40% of packages in the npm community lack tags,and the large number makes it impossible to manually complete tags.For the above reason,this paper focuses on the two problems of category tag construction in the npm community and automated npm packages multi-label classification methods.(1)We propose a method for building category tags based on the associations between tags,which aims to build function-oriented taxonomy categories for npm packages.Firstly,the method uses the association rule mining algorithm to generate a tag association graph for tags in the npm community.Secondly,the tags are clustered based on the association relationship through the community detection algorithm,and several tag communities representing independent functions are formed.Finally,the tag community is filtered and merged manually,and we design a category tag recognition mechanism according to the influence of the tags in the community.This paper applies this method to the 8000 most depended upon packages in the npm community,and obtains 35 representative category tags.(2)We propose a multi-label text classification method for Readme documents to automate the classification of npm packages.This method firstly formulates a content segmentation scheme for extracting functional description information according to the content structure of Readme file.Secondly,a weighted keyword set is used to capture the semantic association between the classification information and category tags in the Readme document,which makes the method perform better than the traditional multilabel text classification method in classification accuracy.The method first constructs a keyword set for each category tag based on the topic-word distribution generated during the training of the supervised topic model L-LDA.Then,we use Word Mover's Distance algorithm to calculate the semantic similarity between the Readme document of the package to be classified and the keyword set of each category tag.Finally,the method assigns the category tag to the package according to the sorting result of similarity.After experimental verification,the multi-label text classification method for Readme document proposed in this paper can effectively classify npm packages in terms of the functionality.Against the multi-label classification methods as the baseline,the method in this paper has greatly improved the three multi-label classification evaluation metrics of Macro-F1,Hamming Loss and LRAP,which confirms the classification accuracy of the method.At the same time,the method also has good performance in the actual untagged packages classification,which confirms the effectiveness of the method.Furthermore,this paper builds a representative dataset for open-source npm package classification research.

Keywords/Search Tags:

npm, open-source software classification, multi-label classification, text analysis

PDF Full Text Request

Related items

1	Research On Text Multi-label Classification Algorithm Based On Label Correlation
2	Research And Implementation On Text Classification In Vertical Domain
3	Research On Multi-label Classification For Scientific Text Resources Based On Deep Learning
4	Research On Web Text Mining Based For Multi-instance Multi-label Classification
5	Research And Design Of Classification Algorithm Based On Massive Multi-label Text
6	Software Bug Triaging Based On Text Classification And Developer Rating
7	Research On Multi-Label Text Classification Based On Deep Learning
8	On Multi-label Text Classification Algorithms Based On Deep Learning
9	Research On The Essential Technology Of Multi-Label Chinese Text Classification
10	Research On Multi-label Classification Algorithms Based On Samples And Property Analysis