Font Size: a A A

Study On Storage And Mining For Clinical And Omics Big Data Of Tumor And Cardiovascular Disease

Posted on:2016-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2334330536467653Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of medical information-technology and biotechnology,the data volume in biological and medical industry is showing the scale of explosive growth.As the most important part of biomedical data,clinical and omics data refer to electronic medical records,inspection reports,medical imagings,signal datas,sequential datas of gene and so on,which were produced in the process of diagnosis,treatment and omics analysis.In some key hospital,the accumulation of clinical data has reached hundreds of terabytes,and the volme in the field of omics researh has reached petabytes,the application of which has great potential for research on the rules of disease development and improvement of diagnosis and treatment,etc.However,with the steady accumulation of huge amounts of data and application of more complicated,the storage and processing of clinical and omics data is faced with many new problems,which has limited the practical application seriously.In order to develop an suitable method of storage for medical big data,this paper takes tumor and cardiovascular disease,which have great harm to human health,as an example.The article deeply analyzes all kinds of data,produced in the process of diagnosis,treatment and prognosis.For integrated storage of multi-source heterogeneous data,high speed and parallel access and efficient data mining algorithm,this paper puts forward some solutions.First of all,on this basis of thoroughly analysis of the composition and technical characteristic of clinical and omics data,the scattered,broken and mixed data were divided into three categories:document data,small binary-file and large binary-file.A system called MSPM(Medical Storage Platform for Mining),which was based on No SQL and Map Reduce,is built for data storage of highly calability and parallel mining.The MSPM achieves the following functions: the storage of integrated geo-data,access of uniform rules and diversified quries,more importantly,it is suitable for data analysis and mining.At the same time,high scalability and high reliability is implemented with data sharding and replication mechanism of No SQL.Then,optimization work is done to overcome performance bottlenecks of MSPM.To solve the imbalances caused by automatic data fragmentation,improvement Strategy based on FDO-DT algorithm was presented.Load balancing between nodes are effectively realized by considering selection of key,chunks number and frequency of operations for chunks,which improves the performance of concurrent read and write of cluster.In order to solve the problem of poor performance caused by frequent access of large files,feature library,coming from extraction of key information of medical documents,extraction of meta information,dynamic capture of mining results,of large files was designed for transfering the direct access to large files.This saves the overhead of total access significantly.Finally,to solve the problems that complex types of data,slow speed of execution,poor targeting and so on,when applying traditonal apriori algorithm in medical big data,improved Apriori-M-DB algorithm is designed and implemented.Data mining for complex types of data is realized by storaging data in the form of key-value pairs,which makes form unified.Then Apriori is executed in parallel through Map Reduce.At last,with the strategies of generating all the candidate sets non-recursively and constraint count for candidate sets of interest,it can solve the problem of low speed,high overhead and poor effectiveness for apriori algorithm in the application of medical data.
Keywords/Search Tags:Medical big data, Storage for mining, NoSQL, Apiori algorithm, Performance optimization
PDF Full Text Request
Related items