Font Size: a A A

Research And Implementation Of Multi-label Text Classification Based On User Generated Content

Posted on:2019-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:J H LiuFull Text:PDF
GTID:2348330545458407Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Web 2.0 technology,user-generated content model has become the fastest growing mode of resource creation and sharing.In the user-generated content mode,a large number of multi-label text data is generated.These text data have a wide range of applications in the fields of information retrieval and data mining.How to classify these multi-label texts automatically is a valuable research topic.In the traditional two-class and multi-class classification problem,each instance is assigned to only one label within a finite set of labels.This kind of problems is also called the single-label classification problems.However,in many real-world applications,each instance is assigned to more than one labels simultaneously.This kind of classification problems is also called the multi-label classification problems.The problem of multi-label learning is extremely challenging due to the fact that multi-label classifiers have to predict a varying number of outputs for each sample.The relevance and co-occurrence between different labels make the multi-label classification problems quite different from traditional single-label classification problems.Multi-label classification technology has now become a hot research topic in data mining.Its research results are widely used in different fields,such as the semantic tagging of image and video,functional genomics and musical emotion classification.In the context of user generated content,not only the content is being updated at a rapid rate,the number of labels is also constantly changing.The traditional multi-label classification algorithm can't adapt well to such scenarios.At the same time,the proliferation of labels has brought issues such as how to efficiently select features for data.Based on the above,the main research work of this paper is as follows:1)MLFSIG,a multi-label feature selection method based on information gain is proposed.Based on the feature-independence assumption,this method calculates the importance of a feature by calculating the information gain of each feature for the label set.Through the optimization of information gain calculation process,the time complexity of the algorithm is greatly reduced.In this paper,experiments on multiple data sets verify the effectiveness of the algorithm.2)ML-RWR,a multi-label classification method based on restarting random walk is proposed.The method maps multi-label data to the vertices on the graph and constructs different random walk graphs through different connection methods.For a new sample,add it to the random walk graph to start a random walk and determine the probability distribution of the sample on the label by random walk results.In this paper,we construct two kinds of random walk graphs,one is random walk graph based on KNN connection and the other is random walk graph based on label central point connection.This paper proposed a self-adaptive multi-label classification method based on the random walk graph with label center point connection.Compared with the traditional multi-label classification method,this method can adapt to the changing scene of labels more quickly.Finally,the experiments on multiple datasets are carried out.The effectiveness of the algorithm was verified by comparing with multiple algorithms on multiple evaluation metrics.3)Design and implementation of multi-label text classification prototype system.Based on Django framework and MVC pattern,we designed a multi-label text classification system.The system can be used to quickly build multi-label text classifier for data processing,model training,data prediction and other functions.
Keywords/Search Tags:Multi-label Classification, Random Walk, Feature Selection, Information Gain
PDF Full Text Request
Related items