Font Size: a A A

Research On Multi-label Learning Algorithm For Entity Information Mining

Posted on:2018-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:S H LiFull Text:PDF
GTID:2348330563951355Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Entity is the main vehicle of data and information in the real world.Mining worthy information of entities,including semantic contents,attribute features as well as intrinsic relationships,is an essential way to cope with the incompletion in big data analysis,optimize the performance of data processing and provide valuable references for tasks.So that learning labels characterizing contents and attributes of the entities is one of the main methods of entity information mining.Although a lot of excellent ideas and methods have been puts forward from different fields and perspectives,which promote the development of the multi-label learning algorithms for entity information mining,there are following three problems to be dealt with.Firstly,in the aspect of instance distribution,imbalance of multi-label data affects the performance of entity information mining,whereas the redundancy of minority classes information and the loss of majority classes information exist when balance the imbalanced entity data.Secondly,in the aspect of feature distribution,high feature dimension of entity data results in the overfitting and high computational complexity,while the existing feature dimension reduction methods hardly take label dependencies into consideration.Thirdly,in the aspect of label distribution,the algorithm efficiency needs to be further improved when faced with the large-scale label sets.In view of above problems,according to the practical requirement of entity information mining,this paper studies the multi-label learning from the perspectives of instance distribution,feature distribution and label distribution respectively.The main works are carried out as follows.1.To handle the problems in instance distribution,this paper puts forward a multi-label random balanced resampling algorithm.The algorithm compares the number of label instances with newly proposed mean instances to maintain the original distribution of dataset,and modifies replication and deletion strategy to ensure the independence of the resampling process of different labels,and proposes the random balanced resampling method to make use of minority and majority classes data to balance the redundancy and the loss.Experimental results show that the proposed method is especially suit for datasets with higher imbalance ratio and achieves outstanding performance among other compared algorithms.2.To address the problems in feature distribution,we propose a multi-label feature selection algorithm based on the improved entity label relevance.The algorithm firstly applies symmetrical uncertainty to normalize the information entropy,and takes normalized mutual information as relationship measurement to define the label importance,with which the label-related items in dependency and redundancy are weighted.In the end,the score function is put forward to evaluate the feature importance,and the best feature subset is selected with the highest score.Experiments demonstrate that after selecting out the concise and accurate feature subset,the multi-label classification is accelerated in terms of not only the performance,but also the efficiency with disperse features.3.Solving the problems in label distribution,this paper introduces a new multi-label classifier based on label matrix factorization.Taking the duality of label matrix elements into account,the algorithm firstly represents the label matrix as the product of latent matrix and k-relation matrix,so that the data is mapped to the lower-dimensional latent space and the label relevance is explicitly described.Then,after the conventional multi-label classification in latent space,the predicted latent labels are transformed into the actual labels by multiplication with the k-relation matrix.Experimental results show that the proposed algorithm is more stable and efficient,especially when faced with datasets with larger label number and cardinality.
Keywords/Search Tags:Entity Information Mining, Multi-label Learning, Data Imbalance, Feature Selection, Label Dependency, Label Matrix Factorization
PDF Full Text Request
Related items