Font Size: a A A

Research On Method Of Subject Indexing In Electronic Records By Topic Model And Knowledge Graph

Posted on:2021-11-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:H W WuFull Text:PDF
GTID:1488306521463174Subject:Information Science
Abstract/Summary:PDF Full Text Request
The world has gradually entered the era of big data from the information age.A large part of the massively generated data is text data written in natural language.Electronic official records are of textual data type,which are digital documents with legal effect and standardized format produced by government agencies,enterprises and institutions.The electronic records accumulated over years are indexed,classified and organized according to the subject content,or topic databases are established to facilitate retrieval,development and utilization.However,the way of manually labeling electronic records has the disadvantages of large labor,slow speed,and inconsistent indexing.Therefore,it is necessary to study the automatic indexing method of electronic records.Aiming at the subject analysis and indexing problem of unstructured electronic records,this chapter proposes a method for automatic indexing of records that combines probabilistic topic model and knowledge graph technology.Compared with the existing automatic methods of subjects indexing,this paper considers the electronic records as a whole to identify topics,building an external knowledge base based on the thesaurus to index the subjects with standardized formal keywords and categories.The content studied in this thesis includes the following three aspects:(1)Analysis and research on the topics of electronic records based on the topic model.Performing topic analysis of electronic records from a holistic perspective,applying a variety of natural language processing techniques to convert records into a corpus,and using topic models for topic recognition and analysis,including fusion of subject vocabulary's semantic features and general topic model LDA for topics Identify and synthesize co-occurrence network features and hierarchical topic model hLDA to analyze the topic hierarchy.(2)Research on automatic conversion of semantics of thesaurus based on knowledge graph.The establishment of a machine-available subject knowledge base based on knowledge graph can make up for the lack of background knowledge in the topic analysis of topic model method in unstructured text,and enhance the semantics of topic indexing.Based on the theoretical research on the semantic representation of thesaurus knowledge graph,a method for semantically automatic conversion of the traditional paper media thesaurus using knowledge graph technology is constructed to index the subject of electronic records,that is building a solid technical foundation.(3)Research on subject indexing of electronic records.A method of subject indexing of electronic records is proposed by comprehensively applying topic model and knowledge graph based on thesaurus.On the basis of applying topic model for topic analysis,the use of knowledge graph technology with the external knowledge of the thesaurus to carry out the indexing work of the subject,using formal subject words and categories for standardization,control and unified organization.Specifically,it includes electronic records subject indexing based on LDA and knowledge graph,and electronic records subject indexing based on hLDA and knowledge graph.The main innovations and unique contributions of this dissertation are:(1)For the practical problem of subject indexing of electronic records,a method of automatic indexing by comprehensive application of topic model and knowledge graph technology is proposed.Extend the object type of the topic model for text subject analysis to the type of electronic records,and explore the method of applying knowledge graph technology to establish an official subject knowledge base and provide subject indexing services.(2)Proposed the method of semantic conversion of thesaurus.Designed a subject vocabulary SKOS data model description scheme and automatic conversion algorithm for electronic records subject automatic indexing task;an algorithm that is automatically converted into a knowledge graph,and the above algorithms are implemented in Python programming language.(3)Completed the semantic conversion task of "Chinese Archives Subject Thesaurus",transforming all the contents of its main table and category table from the traditional paper media form to the knowledge graph form stored in the graph database,for academic community of library science,information science and archives science,as well as the archive management practice community,this has contributed important basic data.40 figures,17 tables,3 appendices are included.
Keywords/Search Tags:subject analysis, subject indexing, electronic records, topic model, thesaurus, knowledge graph, SKOS, automation
PDF Full Text Request
Related items