Font Size: a A A

The Recognition Of Protein Name In The Biomedical Documents

Posted on:2007-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:G LiFull Text:PDF
GTID:2120360182460897Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
In recent years, as a result of the human genome project as well as the molecular biology, the information science development, DNA, RNA as well as the protein biology data quantity unprecedented growth, simultaneously the function gene group and the protein group's mass data started also to emerge. The biomedicine literature quantity in the rapid inflation, the data does not equate to the knowledge, but actually was the data back which the information and the knowledge fountainhead increased sharply actually is hiding many important information. The rapid increase of machine readable biomedical texts makes automatic information extraction from those texts much more attractive. Especially extracting information of protein-protein interaction from MEDLINE abstract is regarded as one of the most important task today. To extract information of proteins, one has to first recognize protein names in a text. This kind of problem has been studied in the field of natural language processing as named entity recognition tasks.At present, the method which used regarding the entity recognition mainly to have following several kinds, based on the artificial organization rule method, dictionary-based approach and machine learning techniques. There are some research efforts using machine learning techniques to recognize biological entities in text, one drawback is that they do not provide identification information of recognized terms. Dictionary based approaches provide ID information because they recognize a term by searching the most similar one in the dictionary to the target term. However, dictionary based approach have two serious problems, one is the spelling variation, the other problem is short names.In this paper we propose a two-phase protein name recognition method. In the first phase we scan the texts for protein name candidates using a protein name dictionary and an approximate string search technique. We also introduced the DICE coefficient and the first word computation method, resolve the problem of the order of words in a protein name is altered, enhanced the recall, In the second phrase we filter the candidates using a machine learning technique. The experimental result had indicated the improvement was the effective.
Keywords/Search Tags:Entity Identification, Protein Name Identification, Candidates, Edit-distance, Classifier
PDF Full Text Request
Related items