Font Size: a A A

Design And Implementation Of A New Germplasm Resources Data Warehouse System

Posted on:2019-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:J N JiangFull Text:PDF
GTID:2428330542999226Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
As the ingredients of biological resources and biodiversity,plant germplasm resources play an import role in the society.Not only do the plant germplasm resources contribute to the safety of food as well as ecology,they are also essential to the sustainable development of agriculture.As one of the biggest countries in biodiversity,China owns abundant plant germplasm resources,both in variety and scale.Owing to the support of the government and the hard work paid by agricultural researchers,the information work of germplasm resources has been conducted.Consequently,the database,which holds germplasm data,has been built and is now in service for the public.However,with the deepening of the information work,germplasm data keep growing,while the hidden value within these data could hardly be mined.In the era of big data,it is essential that we incorporate big data technology into agriculture,thus the storage and sharing of germplasm data could be guaranteed and the value of these data could be revealed.Based on Hadoop technology,especially Apache Spark and Hive,a new data warehouse system is built for the mining of germplasm data.The key researching points of the thesis are seen as follows:Firstly,in the construction of the data warehouse system,many germplasm materials require classification according to their quality.An improved K-means algorithm,which is based on stacked sparse auto encoding neural networks and quotient space theory,is proposed,to help the clustering of germplasm materials.The data are then labelled,thus newly added materials could be automatically classified.Due to the high dimension of germplasm data,it is essential to introduce feature reduction for data processing.With the extracted features,data clustering could be more accurate and less time-consuming.Take the mixed feature data from stacked sparse auto encoding neural networks as original clustering center,the algorithm manages to overcome the sensitivity of selecting original starting points in K-means.Compared with traditional ways of utilizing PCA for dimension reduction,the algorithm turns out better in handling high dimension data for data clustering.Secondly,with the deepening of informationization in germplasm resources,germplasm data keep growing in size and variety,while the utilization radio of the data is low.A data warehouse system is built,based on Hadoop technology,especially Apache Spark as well as Hive.Detailed description and realization of the system is described in the thesis.Compared with traditional systems which are based on relational databases,the data warehouse system is much stronger in handling big data and easier in expansion.The data mining function of the data warehouse could assist plant breeding works in their work,providing scientific help while improving their efficiency.
Keywords/Search Tags:Germplasm resources, Stacked sparse auto encoder, Data clustering, Data warehouse, Spark, Hive
PDF Full Text Request
Related items