| In text knowledge extraction methods,formatting information is lost when text in sequence form is used as raw text data.To facilitate finding knowledge sources in the extraction results and to highlight the main contents of different parts of the text,the thesis focuses on knowledge extraction methods aided by chapter formatting information to help improve automated text knowledge analysis,understanding,and representation.Current research work on knowledge extraction of formatted texts can be divided into approaches based on a priori knowledge and approaches based on machine learning.A priori knowledge-based approaches refer to relational databases to retrieve text content and form knowledge extraction results based on the data returned from the databases.The machine learning-based method extracts the key contents of the text from the text in the form of sentence classification,etc.,and generates the knowledge extraction results.To avoid investing a lot of time and manpower in building a relational database and reduce the cost required for knowledge extraction,the knowledge representation framework of chapter-level formatted text is designed based on the knowledge extraction method of machine learning,and the knowledge granularity is divided into four categories:word-level,sentence-level,paragraph-level and chapter-level.According to the framework design,word-level and sentence-level knowledge are extracted as the basic knowledge units in the text.The word-and sentence-level knowledge is used to analyze the connections among the paragraphs of the text,to extract the paragraph connection map and text summary belonging to the paragraph-level knowledge.Finally,the text knowledge map is summarized and summarized as chapter-level knowledge using all levels of knowledge.The main work and innovation points are as follows:(1)Defining a knowledge representation framework for chapter-level formatted texts.To clarify the key knowledge elements to be extracted in the text,based on the reading and analysis of chapter-level formatted text,four knowledge elements at the word level,sentence-level,paragraph level,and chapter-level granularity levels are defined,and the knowledge representation framework of chapter-level formatted text is designed.(2)A Chinese nested named entity extraction algorithm based on word-sense features is proposed.To extract the entities in the text more comprehensively,a nested named entity recognition algorithm is designed to accurately identify the complex structure of nested named entities in the text by analyzing the structure of nested named entities and analyzing the lexicality and lexical meaning.(3)A chapter-level formatting-oriented text paragraph topic extraction algorithm is proposed.To extract the main contents of each part of the text,the paragraph topics are extracted according to the paragraphs divided by the text format.Based on the principles and ideas of the Text Rank algorithm in the abstract extraction work,the original algorithm is improved from calculating at sentence-level granularity to calculating at word-level granularity,and the topic sentences of the paragraphs are extracted by counting the Text Rank values of the words.(4)A text knowledge graph supplementation method based on a graph neural network is proposed.To improve the quality and perfection of the knowledge extraction results,supplementation is needed.Based on the existing text knowledge extraction results,the text knowledge map is established and provided as input data to the graph neural network for link prediction,and finally,the text knowledge map is supplemented according to the prediction results. |