| The training of artificial intelligence models relies on large-scale and high-quality labeled datasets.The high-quality labeled datasets can be obtained by human annotation,but it is very inefficient and expensive.Distant supervision can be used to automatically construct large-scale annotated datasets.However,the quality of the datasets obtained through distant supervision is not high.The quality of the training set determines the upper limit of the model.Therefore,how to optimize the distant supervision datasets has become a hot research topic.Distant supervision faces different challenges in different scenarios,and its solutions should also be designed for specific problems and usage scenarios.The main work and results of the paper are as follows:(1)This paper studies two scenarios: entity typing in knowledge bases and relation classification,discusses the problem of distant supervision datasets in each scenario,and designs optimization methods according to the characteristics of the problem.(2)We propose that the distantly supervised entity typing in knowledge bases suffers from label noise and semantic heterogeneity.To solve the label noise problem,we propose a novel selection criterion to find the noisiest instances in distantly supervised datasets and propose a hybrid annotation strategy to relabel them.To solve the semantic heterogeneity problem,we propose another novel selection criterion to annotate the most mismatched instances in unlabeled datasets to augment the training set.We use our method in a real Chinese knowledge graph and experimental results show the effectiveness of our method.(3)We propose that distantly supervised relation classification suffers from the label noise problem.We propose to use natural language inference(NLI)model based on pretrained model to identify noises in the datasets.For each data in the relation classification datasets,the text in the data is used as a premise,and the entity pairs and relationship in the data are converted into hypotheses through templates.We use the probability of whether the hypothesis can be inferred from the premise as the evaluation of whether the label annotated by distant supervision is reasonable.Finally,the distant supervision dataset is filtered according to the level of the evaluations.The NLI model lacks large-scale and high-quality training sets.We design a reinforcement learning framework to train the NLI model.At last,we use the well-trained NLI model to filter the distant supervision relation classification datasets and use the filtered datasets as training sets to train the relation classification model.We use the method proposed in this paper on a public distant supervision relation classification dataset and the experimental results prove the effectiveness. |