Font Size: a A A

Research And Implementation Of Structured Processing Of Medical Text Data Based On Spark Platform

Posted on:2018-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZhangFull Text:PDF
GTID:2348330536452499Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Traditional methods of processing medical text are artificial processing based on the doctors' experience in clinical treatment.However,these methods are not only time-consuming,but also can't meet the expected requirements on the accuracy of structured processing.In the era of big data,the growing medical data has brought new challenges for the medical industry: a large number of medical texts are produced when hospitals provide diagnosis and treatments for patients.Among them,the vast majority of medical texts are the semi-structured or unstructured.By transforming the semi-structured or unstructured medical text data into the structured data which can be recognized and understood by computer,we can achieve new breakthroughs on scientific research,clinical diagnosis and treatment,data sharing,etc.The definition of structured processing of medical text is transforming semi-structured or unstructured medical text into structured text.At present,structured processing of medical text mainly divides into two categories: former structured processing and later structured processing.The former processes data through the specified system.The later mainly processes data by utilizing the technology of natural language.The aim of structured processing of medical text is automatically extracting the index name and its corresponding parameter value.For this purpose,this thesis concludes the structure and language feature of medical text.On this basis,a method of structured processing of medical text is put forward.This method mainly has three parts: text preprocessing,new words discovery and information extraction.The text preprocessing mainly performs cleaning,integration,transforming,and specification on text data to make the data consistent and provides accurate data for the later operation.New words discovery finds medical terms in the medical text based on the word embedding.Word2 vec,Google open source word embedding tool is used to train the medical text and transform a word into the n dimensional vector space.The new words can be found and added into the user defined lexicon according to the internal grade between words,the information entropy and word frequency.Information extraction is mainly responsible for designing information extraction rules to extract key information.According to the key words found by the new words discovery,the corresponding key information can be extracted.In the end,structured processing of medical text is finished by organizing them into structured data.This thesis deploys the three parts above in Spark platform and uses distributed computing to complete the structured processing of medical text.In order to verify the feasibility of the proposed method,we randomly select a part of data from the text as sample,which is structured by means of artificial extraction processing.And then comparing the standard results with the results of the method about structured processing of medical text put by this thesis to prove that this method can achieve expected effects.
Keywords/Search Tags:Medical Text Structure, Chinese Word Segmentation, Word Embedding, Information Entropy, Information Extraction
PDF Full Text Request
Related items