| Medical Knowledge Graph(MKG),which seamlessly integrates medical data and domain knowledge,is the cornerstone of intelligent medical application systems.Existing Chinese MKGs have two major problems.Firstly,the underlying data sources for knowledge graphs are mainly from single medical website data without clinical data.The resulting MKGs cannot support clinical use needs.Secondly,the quality of MKGs still need to be improved due to the existence of noise data such as missing information and errors.In order to solve the above two problems,this thesis proposes an approach for constructing and refining a large Chinese MKG aiming at achieving a comprehensive and accurate MKG.The main work of the thesis is as follows:(1)In order to solve the problem of single medical data source,this thesis proposes an approach for constructing MKG based on multi-source medical data.First,this thesis collected five types of medical data,namely: hospital medical record data,original data obtained from medical websites,alias information of Baidu Encyclopedia,open source COVID-19 knowledge graph and ICD-10 code correspondence table.In terms of knowledge extraction,Bert-Bi LSTMCRF model is selected for named entity recognition.Entities and relationships are extracted from hospital medical record data through the process of tens of thousands of medical personnel professional labeling,data training,and model prediction.By calculating the precision and recall of the model,it is determined that the prediction effect of the model can reach about 70%,and four types of entities are predicted,namely,diseases,clinical symptoms,test items,and test results.Secondly,multiple-source medical data are integrated as an MKG and stored using Neo4 j.The resulting MKG has 30,000 nodes and over 100,000 relationships,8 types of entity nodes,7 relationships,and various attributes.(2)In order to solve the noise problem of MKGs,this thesis proposes a refinement method to improve the knowledge graph.Firstly,we identify four types of data errors in the MKG: null node values,ICD attribute multivalued,redundant symbols,and data content errors.Then we propose a set of mechanisms for detecting and correcting these four types of errors.In particular,for data content errors,we present a noise detection method and a correction solution based on Word2 vec similarity and external medical website Page Rank scores.For MKG error detection,an external webpage based solution is used,and Page Rank is used to calculate the ranking of medical websites.The Word2 vec similarity detection method is used to assist in determining the confidence level of the data,combining webpage ranking and similarity values.Secondly,we identify two types of data missing cases in the Chinese MKG: the absence of disease information for the primary node and multiple empty attribute values.Then,we present an automated supplement method based on the external website Page Rank,which searches for missing data through the ranking of medical websites.The proposed error-detection method has corrected over 400 data errors,and the knowledge completion method has supplemented over4,000 missing entities and nearly 20,000 missing relationships.(3)Finally,this thesis validates the Chinese MKG from the perspective of coverage and accuracy.The verification of coverage is performed through randomly selected entities and relationships,and the ratio of the query able quantity to the total number is used as the coverage.The verification of accuracy is conducted by randomly selecting entities and attributes and measure the correctness.The entity coverage rate of the constructed Chinese MKG is 85%,the relationship coverage rate is 74%,and the comprehensive coverage rate reaches 76.43%,and the accuracy rate is 75.95%.In summary,this thesis proposes a set of methods for constructing the high-quality Chinese MKG.The constructed MKG contains more than 40,000 nodes and more than 120,000 relationships,which provides a solid data foundation for the development of related applications in the downstream medical fields. |