Font Size: a A A

Research On Some Key Problems Of Polysemy And Heterogeneous Data Classification

Posted on:2020-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:C HanFull Text:PDF
GTID:1368330620458604Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Classification is one of the important research fields in machine learning.It is used in a wide range of applications in various fields,such as spam detection,bank loan credit evaluation,automatic classification of news texts,protein function prediction,image or video tag prediction,social network user grouping,E-commerce website item classification,etc.Therefore,researchers in various fields have always paid close attention to classification.With the rapid development of Internet technology,a large amount of complex data has been generated in various fields,and the organization of data is diversified.The classification of data is facing new problems.In traditional supervised learning,it is assumed that the classification samples are single-instance.As the data is complicated,such assumptions no longer apply to all supervised learning problems.In most supervised learning problems,it is assumed that the classification examples are independent of each other,but in many practical scenarios,there is a complex relationship between the data.Traditional classification studies classify individual label learning,but,in many practical applications,one example is associated with multiple semantic information.Based on Markov chain and random forest,this paper proposes multiple classification algorithms,such as Hausdorff based multi-instance multi-label classification algorithm,Markov chain-based heterogeneous network entity classification algorithm,and extremely randomized forest with the hierarchy of multi-label classifiers.The main research contents of this paper are as follows.1.For the multi-instance classification problem,we propose a multi-instance multi-label classification algorithm based on Hausdorff distance and Markov chain,H-Mark algorithm.Based on the idea of label propagation,the algorithm uses Hausdorff distance to measure the relationship among examples based on their feature information and construct a similarity matrix.Based on the Markov chain model,the algorithm constructs the transition probability matrix according to the similarity matrix,and obtains the steady-state probability distribution of the examples.Because of the non-zero property of the similarity matrix,this paper uses a neighbor parameter to control the number of neighbors of the examples and obtains a sparse transition probability matrix.Through the steady-state probability distribution of the sample,a reasonable threshold is set to predict the label of the example.In the biomolecular function annotation problem,since each protein is composed of multiple domains,each protein is annotated with multiple functions.Therefore,the multi-instance multi-label classification model is used to solve the functional prediction problem of multi-domain proteins.This paper proves that the performance of H-Mark algorithm is better than the comparative multi-instance multi-label classification algorithm on the actual protein dataset.2.Aiming at the entity classification of heterogeneous networks,we propose a Markov chain-based heterogeneous network entity classification model.This classification model is divided into three parts.First,the heterogeneous network is transformed into a multi-relational network,only the classification entities are retained,and the association relationship among the various types of entities in the heterogeneous network is transformed into the relations among classification entities.Then,a three-way tensor is used to represent the multi-relational network,and the feature-based transition probability matrix and the Markov chain-based heterogeneous network entities classification model are proposed.Finally,we propose the HIN-Mark algorithm to solve the classification model with the iterative algorithm.The HIN-Mark algorithm can simultaneously calculate the probability distribution of the entities and the probability distribution of the relations.Based on the two probability distributions,we predict the labels of the entities and obtain the correlations of each type of relation with the labels.In this paper,we theoretically analyzed HIN-Mark algorithm and demonstrate the existence and uniqueness of the entity probability distribution ?x and the relational probability distribution?z and the convergence of the HIN-Mark algorithm.The results of this paper on multiple real datasets demonstrate the important role of the correlation between network relationships and tags on entities classification and the classification performance of HIN-Mark algorithm over other comparison algorithms.3.For the multi-label classification,this paper proposes a random forest multi-label classification algorithm based on the hierarchical clustering tree,ERF-H algorithm.The tree model of the random forest algorithm is a hierarchical clustering tree.Firstly,we construct the tree model by hierarchically clustering the labels.Then the original data is divided into each hierarchical nodes of trees according to clustering labels in each node.When constructing a random forest,randomly sample the training data of each clustering tree.We obtain the label probability distribution of each clustering tree and predict the labels for unlabeled examples.The experimental results of multiple multi-label data sets prove that the ERF-H algorithm is superior to other algorithms.
Keywords/Search Tags:multi-instance classification, heterogeneous information network classification, multi-label classification, tensor, Markov chain
PDF Full Text Request
Related items