Font Size: a A A

Research On Key Technologies And Application Of Entity Knowledge Extraction

Posted on:2023-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:K Y PangFull Text:PDF
GTID:1528307169977029Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of information,the scale of text resources and structured knowledge of the Internet is expanding rapidly,and knowledge-based Internet products have emerged in an endless stream.These applications rely on extracting entity knowledge to be realized.Although knowledge engineering and information extraction research has developed over the past decades,it still faces some difficulties in practical scenarios.For example,sufficient context resources or links in a knowledge graph are needed in entity representation,failing to cope with the scenario where context information is insufficient or even missing.Entity classification tasks that rely on distant supervision methods to provide sufficient training samples are affected by the noise in the distant supervision samples and cannot achieve the best performance.In complex tasks such as entity knowledge annotation,existing research has oversimplified the problem and failed to generate truly high-quality annotations that cover all entities and are well selected.Existing corpora directly reflect the final result of complex thinking processes by human editors and are challenging to fit directly through end-to-end models.To solve the above issues,this paper studies key techniques in the fundamental tasks of entity knowledge extraction:entity representation independent of context information,noisy learning entity classification technology,and entity annotation technology based on representation enhancement to model complex needs.This paper also builds the data resources needed by the research to meet the knowledge processing needs in the online education scenario,and applies the above key techniques to the entity knowledge extraction of online course resources.The main contents and contributions of this paper are as follows:Firstly,in terms of entity representation independent of context information,this paper presents the Compositional Continuous Bag of Words(C-CBOW)model and the Neighbor Cluster Average(NCA)method for calculating the representation of compositional semantic embeddings,which estimates the distributed representation of phrases through the combination of word vectors to solve the representation problem of phrase entities.First,by randomly replacing words with characters that make up the words in training,the representation of words is weighted to get the representation of characters,thus making the characters,words,and phrases with combinatorial relationships closer in the C-CBOW word vector space.Then,by retrieving a word’s neighbors,the neighbors’embeddings represent the word’s senses.Finally,the embedding of phrase compositional semantics is obtained by modeling the combination of word senses that make up phrases with neighbor clustering.Experiments show that the C-CBOW word vector model and NCA method can obtain more accurate semantic embeddings than the existing methods.Secondly,in dealing with distant supervision noise in entity typing,a feature clustering forward loss correction method(FCLC)is proposed.FCLC uses the feature distribution of the samples to divide the samples into different clusters,then estimates the distant-supervised noise in distant supervision noise using high-quality samples in each cluster,and finally uses the cluster-wise forward loss correction method to train unbiased models using noisy samples.Experiments and analyses on three authoritative evaluation datasets show that FCLC has significantly improved the performance of the previous finegrained typing systems.This paper also designs experiments to show that FCLC requires only a few high-quality samples and can still work through pseudo-label generation even without high-quality samples.Thirdly,in terms of entity knowledge annotation,this paper presents a Wikipediastyle entity discovery and selection model(Discovery and Selection like Wikipedia,DsWiki).For existing research that limits entity scope and ignores selection,this paper proposes the task of mention detection and selection for entity annotation,and builds the dataset WikiC that reflects the human experts’ annotation in Wikipedia.For open entity boundaries,the DsWiki model uses a framework based on end-to-end named entity recognition to enumerate and rate all the substrings in the text.To supplement the information on referential discovery,DsWiki uses the results of entity discovery and linking based on the mention-entity table as a reference list and enhances the representation of tokens through an attention mechanism.To model Wikipedia’s requirements to avoid multiple links to the same entity within an article,DsWiki enhances the representation with repetition encoding into the representation of tokens.Experiments against existing entity labeling models on the WikiC dataset verify the effectiveness of DsWiki in fitting the entity annotation results of Wikipedia experts.Fourthly,in education-oriented application research,this paper applies entity knowledge extraction techniques to the entity knowledge labeling task in online course resources.While existing research on online course resources annotation ignores context and conducts simple classification,this paper constructs a fine-grained course subtitle entity annotation datasetCFG-MOOC,which supports new application scenarios with two dimensions of combining context information and fine-grained course concept subdivision classification.This paper builds a baseline system for each new application scenario using the key entity knowledge extraction techniques,which provides a basis for further research on knowledge mining of online learning resources.
Keywords/Search Tags:Entity Knowledge Extraction, Entity Representation, Fine-grained Entity typing, Entity Knowledge Annotation, Representation Enhancement, Information Extraction, Natural Language Processing
PDF Full Text Request
Related items