Font Size: a A A

A Study On The Recognition Of Biomedical Named Entity Based On Statistic

Posted on:2007-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:S QiuFull Text:PDF
GTID:2178360218462592Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
NER(Named Entity Recognition) in biomedical literature is presently one ofthe internationally-concerned NLP(Natural Language Processing) researchquestions. The studies on NLP have already won remarkable success in a fewfields, however, they have achieved little in the biomedical domain. With theflourishing development of biomedicine, new NEs(Named Entities) are emergingone after another. Irregular naming as well as new uses of old words havemade Bio-NER(Biomedical Named Entity Recognition) a hard task, to somedegree, influencing the development of research in biomedical domain. Thereare a great number of research methods for Bio-NER, of whichSNLP(Statistical Natural Language Processing) is one of the methods frequentlyused for Bio-NER research, because its study methods, based on statistics, donot require the researchers' profound professional knowledge in biomedicine.In addition, among methods of SNLP, HMM (Hidden Markov Model) is widelyapplied due to its statistic features.HMM is a significant approach to constructing statistic models in themodern speech recognition system. It's able to study rules with a few trainning data. Up till now, a great many of international researchers haveworked on answering Bio-NER research questions by adopting HMM and itsvarieties. Though they have made some remarkable progress in it, none ofthem has achieved the goal of "approximating to human beings". Manyquestions have remained to be answered, but actually in China researches onBio-NER are still in the beginning stages. In this case, this thesis depicts astudy on constructing a statistic model for Bio-NER by adopting HMM. Thestudy is illustrated as follows:1. HMM is trained in annotated corpus using statistics.By counting upannotated datas, parameters of HMM are obtained: set of states(S), outputalphabet(K), intial state probabilities(p), state transition probabilities(A),symbol emission probabilities(B). Some regular patterns of NEs are foundby adopting different methods in various experiments, and those patternsare further incorporated to form K set. Probabilities are counted on thebasis of the procedure above. When probabilities being calculated, in orderto solve the problem of lacking sufficient data, an approach of linearinterpolation is adopted to smooth. In the study, a concept of LSS(LexicalStructure Similarity) is given, which provides a measurable standard insymbol comparing.2. The trained HMM is tested on non-annotated corpus. A sentence ofnon-annotated corpus is used as an input sequence of HMM, and then anoutput sequence is computed through Viterbi algorithm. As a result, therecognized Bio-NEs are found. When the input sequence is formed,different ways to dividing a sentence into words are applied to differentexperiments. By means of computing the similarity between a series ofwords in a sentence and each item in K set, and besides, by simplyanalyzing parts of speech as a supplement, the bordering of dividing asentence into word sequence is determined.3. The HMM is improved by calculating and comparing Recall and Precisionof the tested result. The above procedures are repeated till a HMM that could effectively recognize Bio-NEs is formed.The present research on Bio-NER has produced a marked achievement inthe study narrated above. The effectiveness of the algorithm is verified.
Keywords/Search Tags:Statistical Natural Language Processing (SNLP), Biomedical Named Entity Recognition(Bio_NER), corpus, Hidden Markov Model(HMM), Viterbi algorithm, smoothing technology
PDF Full Text Request
Related items