Font Size: a A A

The Research Of Parallel Algorithm For Named Entity Recognition Based On Biomedical Literature Data

Posted on:2016-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:L G JiangFull Text:PDF
GTID:2428330473964954Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With all kinds of wide applications of electronic texts in the era of information explosion,vast data have brought serious challenges for people to efficiently obtain useful information and relevant knowledge,especially for literature and documents within which include a large number of results and the experiment found in the fields.As one of research focuses in this field,text mining technology can quickly and efficiently obtain relevant knowledge in the mass of literature.It is an interdisciplinary field that is related to natural language processing,information retrieval,information extraction,data mining,computational linguistics,machine translation,block analysis,and so on.As the important foundation of text mining,named entity recognition aims to locate and classify atomic elements with some special significance in text into predefined categories.Due to the particularity and complexity of named entity in biomedical field itself,named entity recognition based on biomedical literature has accuracy and efficiency of the dilemma.Now using machine learning strategies,exploring rich feature sets and other methods have been relatively mature,which improves the accuracy of biomedical named entity recognition well,however,its efficiency problem becomes more and more prominent.When facing large scale of literature data set,biomedical named entity recognition takes huge time in a single-machine environment,as the increase of computing time for model training and inference process is nonlinear.In order to develop the relative text mining technology,the research of improving efficiency is of great worth.It provides the scientists in the field of an efficient research tool and let them focus on a higher sense of job.At the same time,this study also has a certain degree of guidance for the other similar work in the relevant field.Based on the summary and analysis of biomedical named entity recognition methods and specific to the issue that traditional single machine processing performance is low in training algorithm of Conditional Random Fields(CRFs)strategy with large-scale data set,we proposed a parallel optimization algorithm CRFs-L-Map Reduce,for training stage of CRFs model based on the second generation of Hadoop platform.It optimizes Conditional Random Fields' training algorithm of parameter estimation procedure,and improves the speed of the named entity recognition based on CRFs in training process.Experiments show thatCRFs-L-Map Reduce has faster convergence speed than the traditional stand-alone CRFs training algorithm under large scale of biomedical training data set,while efficiency is about 4.4 ~ 7.4 times.And CRFs-L-Map Reduce increases the training efficiency by improving the clustering performance,which demonstrates that it has good ability of extension.Furthermore,with the deep analysis of the workflow in CRFs model inference algorithm and combining with current popular big data processing technology,we proposed an model inference parallel optimization algorithm CRFs-V-Spark based on memory computing.Spark,due to its advantages in the iterative calculation and memory computing,can automatically schedule complex computing tasks,avoid the intermediate results of disk read and write and resource application process,which is very suitable for data mining and machine learning under big data era.CRFs-V-Spark is not only compatible with the Hadoop cluster,but also uses flexible memory computing resources to efficiently handle huge amounts of data.Experiments show that this CRFs-V-Spark's recognition time is far lower than the single CRFs inference algorithm as the recognition efficiency increased by about 5 ~ 9 times,and can be further to improve its performance by the improvement of memory conditions.CRFs-V-Spark well realizes the memory computerization and parallelization of CRFs' inference algorithm,greatly improves the real-time performance of data processing,and further enhances the efficiency of biomedical named entity recognition.
Keywords/Search Tags:Named Entity Recognition, Conditional Random Fields, Hadoop, Spark, Parallel Algorithm
PDF Full Text Request
Related items