Font Size: a A A

Research On Text Classification Technology For Asymmetric And Multi-label Problem

Posted on:2020-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:X JinFull Text:PDF
GTID:2428330590495999Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today's rapidly evolving information technology allows people to quickly create and share information,and the main way people get information is the file of electronic form.It will become more difficult to find the information quickly and accurately they need in more diverse and disorganized information.Text classification technology is an important technology for organizing and classifying documents.However,with the development of self-media,the phenomenon of text concept migration is more and more frequent.Traditional single tags can not accurately describe the dynamic changes of real objects,but multi-label classification technology can accurately and objectively describes the multi-semantic phenomenon of real objects and it urgent need.There are not only exist multi-label problems,but also occurring the asymmetric problems(ie,data imbalance problems).For the problem of data imbalance,the solutions are generally divided into three types,mainly from the three levels of algorithm,feature selection and data.The method at the algorithm level is improving some existing classification algorithms;The method at the data level mainly uses the resampling technique to improve the class distribution of the data;At the feature selection level,it is generally improved by modifying the existing feature selection algorithm or by proposing a new feature selection algorithm to adapt to asymmetric datasets.PKM-undersampling algorithm is mainly carried out from the data level,the idea of downsampling technique is adopted,and reducing the number of samples by clustering on majority class sample sets.The algorithm proposed in this paper is optimized on the k-means algorithm,compared with the original k-means algorithm,the experimental results has a certain improvement.For the multi-label problem,the traditional algorithm mainly adopts two strategies that based on problem transformation and algorithm adaptation.The strategy based on problem transformation mainly transforms the multi-label problem into a single-label problem by some means,and then uses the single-label classification algorithm to classifying som text.The algorithm based on algorithm adaptation is mainly improved on some single label algorithms to adapt to multi-label classification problems or proposed the new multi-label classification algorithm.This paper proposes a new multi-label classification algorithm PLPLC(Promoved-LPLC)algorithm and it improved from the LPLC(Local Pairwise Label Correlation)algorithm.This algorithm not only considers the positive correlation of labels of original LPLC algorithm,but also considers the negative correlation between labels.Compared with the LPLC algorithm in experiment,it has been greatly improved in many multi-label evaluation indicators,and has certain advantages compared with other multi-label algorithms.Studying asymmetric problems and multi-label problems can improve the classification accuracy in text classification to a certain extent,and thus improve the validity and accuracy of information retrieval.Although the PKM-undersampling algorithm solves the asymmetry problem to a certain extent,it does not take into account some categories contain too few samples in the dataset.Using this algorithm will lead to a sharp decrease in the dataset size and thus reduce the classification performance of the classifier;PLPLC algorithm considering the correlation between labels,but only considering the correlation between the two labels,the method proposed by the paper can be further optimized in the future research,so as to effectively improve the robustness of the algorithm.
Keywords/Search Tags:data imbalance problem, multi-label problem, downsampling, PKM-undersampling algorithm, PLPLC algorithm
PDF Full Text Request
Related items