Font Size: a A A

Research And Application Of LSTM Sequence Annotation Model For Medical Literature

Posted on:2021-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:J T HuFull Text:PDF
GTID:2404330623979537Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the basic unit of domain knowledge base,medical knowledge entity is an important language unit that carries information in the medical literature.How to extract structured knowledge that can be understood by computers from the unstructured text has become the critial point for automatically building a knowledge base in the medical field.Previous research mainly focus on the accuracy of a single extraction algorithm,paying less attention to the hierarchy of domain knowledge categories or the training efficiency of knowledge extraction models.Additionally,a single algorithm often fails to make the best of text representation and the structural features that may play important roles,which has become the main reason for insufficient generalization of the overall extraction.Based on the domain characteristics of medical knowledge,this thesis analyzes the domain knowledge system from the perspective of medical activities.With the most suitable method of knowledge representation method,the domain knowledge model is constructed and the triple annotation of knowledge is formulated.With neural network language model as the main framework,this thesis realizes the automatic extraction of medical knowledge entities,verifies the effectiveness through groups of controlled experiments and proves that the algorithm of knowledge extraction works well in terms of generalization and robustness via a prototype system.The main work is as follows:(1)In order to solve the problems that the hierarchical Softmax algorithm cannot perform incremental training and the low efficiency in the training of massive data,this thesis proposes a dynamic hierarchical Softmax algorithm.By incrementally loading data samples,a node replacement method is used to dynamically build a coding tree in the process of incremental training of samples.Simultaneously,to avoid the oscillating decline of the loss function due to the small sample size,the first-order moment estimation and second-order moment estimation of the gradient are used to dynamically adjust the parameter update direction and learning rate,and the gradient variation range is used to reduce the weight change range and convergence training error to improve word vector training efficiency.Using Wikipedia Chinese corpus as experiment data,the training efficiency is measured and its quality analyzed.It is found that the dynamic hierarchical Softmax algorithm significantly does improve the training efficiency and shortens the training period compared with existing methods.(2)The current LSTM-CRF model based on character or word sequence fails to explicitly utilize information between words and word sequences.Therefore,this papery proposes a Lattice grid structure representing all possible word combination in the sentences and collects the potential compound words into one grid unit in order to avoid the noise caused by segmentation errors,automatically control the information flow in the sentence and improve the targeting of model tags.Moreover,in view of the inconsistency of labeling caused by training,this thesis employs the Attention mechanism that could obtain global information at the document level.Specifically,this thesis improves the attention matrix and defines a variety of alignment functions instead of a single scoring formula in the original matrix to measure the similarity between words in the document and predict the final labeling based on the eventual confidence score.Compared with the current methods,the Att-Lattice LSTM-CRF model can effectively alleviate the inconsistency of labeling and improve the robustness and adaptability to various fields of the model in recognizing compound knowledge.(3)The BIOS marking method fails to map the internal contact of the knowledge entity to the label.Hence,this thesis proposes a new method of knowledge modeling through a general analysis of medical activities and a summary of the reusable entity categories and their internal connections.Based on the Unified Medical Language System(UMLS),the existing medical knowledge marking scheme is improved to provide professional guidance for subsequent upstream tasks such as knowledge extraction,knowledge fusion,and knowledge display.(4)On the basis of CMEKG medical knowledge display platform and Labelme text annotation tool,this thesis designs and implements the architecture and functional modules of the prototype system.Through providing evaluation criteria for each model,the usability and efficiency of the prototype system are verified.
Keywords/Search Tags:knowledge extraction, hierarchical softmax algorithm, domain modeling, lattice structure, attention mechanism
PDF Full Text Request
Related items