Font Size: a A A

Text Categorization Of High Dimensional Imbalanced Data Based On Depth Label Correlation Mining

Posted on:2018-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:X F JieFull Text:PDF
GTID:2348330569486408Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The traditional text classification assumes that each document is associated with only one category.However,in real-world text categorization tasks each document usu-ally has multiple semantic meanings.Thus,multiple labels are required to accurately describe a text document.Multi-label text categorization is an important method to solve the multi-semantic text classification problem as it can precisely and effectively present the complicated semantic meanings of documents.Multi-label learning has been a prominent topic in the text categorization paradigm.However,as a more general text categorization method,multi-label text categorization method usually requires more complex classification models and is more challenging to solve.The three difficulties in multi-label text categorization method can be summarized as following three aspects: 1)how to improve the efficiency and accuracy of multi-label learning algorithms in processing high-dimensional dataset;2)how to effectively explore and utilize label correlations;3)how to deal with the imbalance problem in multi-label text categorization.Aiming at providing effective and efficient solutions to multi-label text categorization tasks,the research works in this thesis can be summarized as following two aspects.1.Multi-label text data usually has the characteristics of high dimensionality,sparse feature space,low-similarity among same classes.In order to solve multi-label text categorization problem effectively,the dimensionality of text data needs to be reduced so that the accuracy of classification can be improved and the complexity of classification can also be decreased.To this end,this thesis introduces a feature transforming method based on fuzzy similarity.The fuzzy similarities between features and labels are computed and utilized to transform the high-dimensional text documents to lower dimensional relevance vectors.2.For the imbalance problem in multi-label classification,a two-stage multi-label learning algorithm is proposed.This algorithm divides all labels into two groups,i.e.imbalanced labels and common labels,based on the imbalance ratios of labels.In the learning process of the first stage,multi-label hypernetwork model is trained to produce basic predictions for all labels.The learning in the second stage is aimed at improvingthe classification performances on imbalanced labels with extra information provided by the correlations between common labels and imbalanced labels.Experimental results are conducted on eight multi-label text dataset to verify the effectiveness of the proposed methods.Firstly,in order to verify the effectiveness of the proposed dimensionality reduction method,the classification results of BR-SVM,CLR and ECC on original data sets are compared with the classification results on data sets after dimensionality reduction,respectively.Secondly,the classification results of the proposed methods are also compared with that of the BR-SVM,MLKNN,CLR,ECC,RAkEL,and COCOA to verify the effectiveness in dealing with class-imbalance problem.The experimental results demonstrate that the proposed method achieves comparable classification performances in dealing with high dimensional,class-imbalanced text categorization problems against many state-of-the-art mutli-label learning methods.
Keywords/Search Tags:multi-label classification, evolutionary hypernetwork, multi-label hyper-network, label correlations, imbalanced data
PDF Full Text Request
Related items