Font Size: a A A

Research On Core Drug Identification Of Chinese Medicine Formula Based On Knowledge Discovery

Posted on:2022-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:1484306524971219Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese medicine formula is the main method of traditional Chinese medicine to treat diseases.Since the records of Chinese medicine formulae,many formulae have been recorded in Chinese medicine books and literature.Chinese medicine formulae are made up of Chinese drugs according to compatibility principle.The "Jun" and "Chen" drugs are considered as core drugs and play a key role in the treatment of diseases to treat the main syndromes and symptoms of patients.Clarifying the core drugs for the treatment of diseases is conducive to uncovering the compatibility rules of formulae,discovering the key drugs for treating diseases,assisting doctors in composing more rational and effective formulae.This dissertation focuses on designing knowledge discovery models for analyzing structured formula data based on community detection algorithm and analyzing unstructured literature data based on Chinese word embedding model to conduct the research of the core drug identification of Chinese medicine formula.The main research works of this dissertation are presented as follows:1)Community detection can mine the node sets with similar attributes and find important nodes in networks.Core drugs can be considered as important drugs in the collection of drugs with the same or similar efficacies.By constructing the relationships among drugs as a drug network,drug communities and important drugs can be detected in the drug network,so as to realize the core drug identification of Chinese medicine formula.For community detection,two community detection algorithms,Whale Optimization based Community Detection Algorithm(WOCDA)and Node Ability based Label Propagation Algorithm(NALPA),are proposed,which lay a research foundation for structured formula data analysis.In WOCDA,by imitating the hunting behavior of humpback whales,a new initialization strategy and three operations of shrinking encircling,spiral updating,and random searching are designed to optimize the modularity density and realize community detection.In NALPA,inspired by the human society,the node’s propagation ability,attraction ability,launch ability and acception ability are designed to measure the importance and influence scope of nodes.Inspired by the radar transmission mode,label importance is designed to measure the change of weight when labels are spread to other nodes,and then a new label propagation process is designed to deal with the instability of label propagation based community detection algorithms.Experimental results show that the quality of the communities detected in the synthetic and real-world networks by the two algorithms is higher than that of comparison algorithms.2)Aiming at core drug identification from structured formula data,this dissertation proposes two Chinese medicine formula core drug identification models,Core Drug Identification model based on community detection with Label Weight(CDILW)and Core Drug Identification model based on community detection with Graph Layout(CDIGL).By analyzing the drug relations,the drugs are modeled as nodes.If two drugs treat the same syndrome and symptom,an edge is established between the two drugs,and a drug network is constructed.Chinese medicine formula core drug identification models for structured formula data consist of two stages:drug community detection and core drug recognition.In the first stage,the drug communities for different syndromes are detected on the drug network.In the first stage of CDILW,based on the force-directed network layout,the attractive force between nodes are designed to represent drug similarity,and label importance is defined by combining node attraction and node importance to represent the importance of different drug efficacies and consider the update of label weight in the process of label propagation to improve the stability of drug community detection.In the first stage of CDIGL,based on the(a,r)-energy model,the network is first drew as compact layout,and an position node initialization strategy is proposed.Then the network is drew as uniform layout.Based on the node attraction of uniform layout,dynamic node importance and label importance are designed to represent the drug importance and the importance of different drug efficacies to fuse graph layout and community detection for improving the stability of drug community detection.In the second stage,the nodes with higher degree in the drug communities are regarded as core drugs to realize the core drug identification.The experimental results show that the two core drug identification models can detect the core drugs for different syndromes,which demonstrates the effectiveness of the proposed models in identifying core drugs from structured formula data.3)The formula and drug information are mainly recorded in literature.Chinese word embedding model can analyze the semantics of words based on the context of words,understand the properties,efficacies and indications of Chinese drugs,and generate semantic word embeddings of drugs to calculate the similarity of drugs and construct drug semantic network,then drug communities and core drugs are detected.For drug semantic analysis,two Chinese word embedding models,stroke,structure and pinyin feature substrings based Chinese word embedding model(ssp2vec)and Syntax,Word cO-occuRrence and Inner-character Similarity based Chinese word embedding model(SWORIS),are proposed for laying the research foundation of unstructured literature data analysis.In ssp2vec,feature substring is designed to combine the stroke,structure and pinyin features of Chinese words and consider their relevance for predicting the context of Chinese words and realizing the semantic representation of Chinese words.In SWORIS,a symmetrical convolution automatic encoder is designed to analyze Chinese character figures and extract stroke and structure potential features.The similarity of Chinese words is calcualted,and then the syntax,co-occurrence and similar contexts of Chinese words are preserved through graph network.The probability random walk based sampling strategy is proposed to generate graph context,then SWORIS predicts graph context based on the target word and realizes the semantic representation of Chinese words.Experimental results show that the two Chinese word embedding models are superior to comparison models.4)Aiming at the core drug identification from unstructured literature data,this dissertation proposes two Chinese medicine formula core drug identification models,Core Drug Identification model based on Chinese word embedding with Ensemble Feature(CDIEF)and Core Drug Identification model based on Chinese word embedding with Feature Probability(CDIFP).The literature related to the treatment of target disease is retrieved to establish disease corpus by preprocessing.Chinese medicine formula core drug identification models for unstructured literature data consist of four stages:drug word embedding learning,drug semantic network construction,semantic network community detection and core drug recognition.In the first stage,the semantic word embeddings of drugs are learned by the potential domain knowledge in the disease corpus.In the first stage of CDIEF,the stroke n-gram,structure and pinyin of drug words are integrated,and the contextual words are predicted based on the integrated features of the target word to analyze the semantics of Chinese drugs,then they are represented as semantic embeddings.In the first stage of CDIFP,in order to analyze the polysemy of Chinese words,Chinese words are represented as multiple Gaussian distributions combined with feature substrings to design the feature probability distribution for capturing different aspect semantics of Chinese words to analyze the drug meanings of treating different syndromes.Then a similarity based objective function is optimizated to realize the semantic representation of Chinese words.In the second stage,drugs are regarded as nodes,and edges are constructed between drugs with high similarity to establish the drug semantic network.In the third stage,drug communities for different syndromes are discoveried based on community detection.In the fourth stage,the nodes with higher degree in the drug communities are regarded as core drugs.The experimental results show that the two core drug identification models can detect the core drugs for different syndromes,which illustrates the effectiveness of the proposed models in identifying core drugs from unstructured literature data.
Keywords/Search Tags:Chinese medicine formula, core drug, knowledge discovery, community detection, Chinese word embedding
PDF Full Text Request
Related items