Font Size: a A A

Research On Unsupervised Narrow-domain Entity Recognition Method For Biomedical Literature

Posted on:2019-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:G C DongFull Text:PDF
GTID:2428330593950458Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The biomedical document mining system is an indispensable tool for research in the biomedical field,and named entity recognition is a crucial part of text mining technology for biomedical literature.At present,the method based on machine learning is the hotspot of entity recognition research,these methods are mainly aimed at the recognition tasks of field entities such as genes,proteins,diseases,etc.that have large entity sets and training corpus sets.However,the formal knowledge bases,such as knowledge graph,are constantly deepening and materializing in various field of intelligent application,the importance of entity recognition in the subdivided domains,i.e.,narrow-area entity recognition,becomes increasingly significant.So,the narrowdomain entity has the limited number of domain entities and the relative lack of text samples,the unsupervised entity recognition methods based on the depth learning or the external knowledge source cannot give full play to its own advantages and obtain satisfactory entity recognition effect.In this paper,the research is carried out under the above background,and the following research results are obtained:1.In order to apply the context information of the candidate entity to the unsupervised narrow domain entity identification,a short text similarity calculation based on LDA(latent Dirichlet allocation)was proposed.This algorithm is introduced a feature word screening mechanism based on the LDA,which makes use of the short text topic information obtained by LDA,and extracts feature words from short texts to reduce the interference of words that are irrelevant to the semantic expression of short text;and this algorithm is also designed a the semantic weight learning mechanism based on PSO(particle Swarm optimization),which utilizes fully PSO algorithm with the characteristics that contain high-efficiency parameter optimization ability,good robustness and high search speed,so as to realize the setting of the weight of feature word based on semantic contribution degree.Then,a new short text representation model,namely "Word Embedding+LDA+PSO",is constructed to solve the problem of insufficient semantic representation of the existing short text representation model.2.A named entity feature extraction method based on short-text semantic distance measurement is proposed to extract the domain features of candidate entities for narrow-domain entity recognition.This algorithm uses the linguistic features and statistical features of the candidate entities,which uses the occurrence times of the context of the candidate entity in the corpus to characterize the distribution of candidate entity in corpus.The occurrence times of the context of the candidate entity in the corpus is determined by computing the semantic distance between the contexts of the candidate entity and the contexts in the corpus.Compared with the traditional feature extraction method based on statistics,this algorithm integrates the semantic information of the candidate entities,therefore,the features of the candidate entities obtained by this algorithm are more stable and have stronger characterization capability.3.Proposed an unsupervised method for narrow-domain entity recognition by fusing domain relevance measurement and word features of context.The algorithm uses the short text similarity calculation method and the named entity feature extraction method described above to design a new term-corpus hypothesis,which take advantage of the log likelihood ratio to measure the difference of candidate entities in the corpus.And based on the relative proportions of the central words of the candidate entities in the corpus,a domain dependency function is constructed to measure the tendency of the candidate entities in the corpus,thereby integrating the log likelihood ratio and the domain dependence function to construct the domain relevance function,which realizes narrow-domain entity recognition.
Keywords/Search Tags:Biomedical literature mining, narrow-domain entity recognition, short text similarity, Domain Relevance Measurement, log likelihood ratio
PDF Full Text Request
Related items