As an interdisciplinary subject,biomedicine is developing rapidly.Consequently,the professional knowledge and the text materials of biomedical literature are increasing in an extremely high speed.These large amount of text materials may contain lots of valuable information and data.The ultimate goal of big data based on biomedical text mining technology is to explore those useful information and provided to the researchers.Biomedical named entity recognition is the critical step of biomedical text mining technology.The traditional centralized method for biomedical named entity recognition is hard to process massive data,so in our research we use distributed process method for biomedical named entity recognition model training and process massive data on Hadoop.The content of our research can be divided into the following three parts:(1)The parameter training of Hidden Markov Model based on MapReduce.In this study we implement the parameter training procedure of Hidden Markov Model through MapReduce and get the three parameters: initial state probability distribution,transition probability matrix,emission probability matrix by compute the initial state distribution,the transition between two states and the word emission by each state.We contrast the Conditional Random Fields with Hidden Markov Model to test the parameter training efficiency and the performance of biomedical named entity recognition on Hadoop of HMM.We compute the gradient vector of weight factors of characteristic function by a parallel calculation on Hadoop,and iterate the optimal paremeter of CRF.The result of the contrast of two the models on Hadoop shows that by using the same training data the performance of CRF based named entity recognition is slightly over than HMM,but when the size of training data set keep increasing the efficiency of the HMM training is much better than CRF.Thus we choose to use HMM to recognize biomedical named entity of massive data on Hadoop.(2)HMM based biomedical named entity recognition on Hadoop.The named entity recognition on Hadoop is divided in two MapReduce procedure: the first MapReduce procedure is for data cleaning,wipe off those interferential and useless data and generate a new test corpus.The second MapReduce procedure,we complete sentence segment,word segment and the part of speech tagging procedure in Map stage,and the sentence with part of speech tagging to the Reduce stage.In the Reduce procedure,we use Viterbi Algorithm and tagging the sentence with biomedical named entity according to the Hidden Markov Model that we have trained in(1),and get those sentences with biomedical named entity tagging at last.The experiment result on shows that,the efficiency named entity recognition based on Hadoop is way higher than the standalone,and could save a lot of time. |