Font Size: a A A

Research On Medical Knowledge Base Extraction Based On Internet Information

Posted on:2018-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:J Y TianFull Text:PDF
GTID:2348330536981906Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Medicine is one of the most important science to human beings.In order to improve the ability of medical diagnosis and treatment,medical informatization has become a hot spot.The construction of medical information system require the support of medical knowledge base.The key to the construction of knowledge base is knowledge acquisition.Medical knowledge is usually stored in natural language texts,which can be easily understood by human beings,but not by machines.Only after the information extraction can the medical knowledge be transformed into structured data and used by the machine.The first step of information extraction is named entity recognition.Howerver,the lack of open medical corpus makes this work rather difficult.Existing work commonly relies on a small amount of manually annotated text,so that it can't be widely promoted.This paper thinks that using automatic methods to construct larger corpora is a better choice.The Internet is a collection of large quantities of data,including many medical websites.These websites hold medical texts,which can be regarded as unlabeled corpora.Meanwhile,most websites maintain the index of medical vocabulary used for searching,which can be used as a kind of dictionary resource.With these resources,the following studies have been carried out on named entity recognition:(1)An iterative framework is proposed to exploit the Internet resources.Considering the limitation of automatic method and the incompleteness of dictionary resource,it is believed that learning iteratively is beneficial to improve the effect of the model.In the framework,We firstly use the initial settings for labeling.After a round of training,new words found will be added to the dictionary.Then we use the new dictionary to train instead.With serveral iterations,the final model will be improved.(2)An automatic annotation method based on general model and domain dictionary is proposed.Although Internet resources are numerous,they are lack of label.That means they can't be utilized unless the label is given.If a general model is used to mark them,the effect will drop because of fields difference.Therefore,we add the dictionary resource into the general model to improve the accuracy of annotation.At the same time,the model is read-only.So it is suitable for an iterative framework.(3)The research on incremental named entity recognition model is carried out.Considering the large size of the texts and the iterative framework we present,the cost of retraining will be too high to afford if traditional methods are used to build the model.Therefore,this paper uses an on-line algorithm,the average perceptron,to realise incremental training.In the model,this paper introduces various features,including lexical features,affixes features,word representation features and so on.Multiple sets of experiments are conducted under different conditions to test the effect of our features.(4)The research on model compression is carried out.In order to cope with the excessive features of model,this paper present a heuristic method tocompress the model.Considering the characteristics of the perceptron model,the number of updates is used to mask the features so as to reduce the size of the model.Experimental results show that the our method can compress the model effectively with little drop in accuracy.
Keywords/Search Tags:Internet resources, iterative framework, named entity recognition, average perceptron, model compression
PDF Full Text Request
Related items