| With the development of China’s agricultural science and technology informationization,the agricultural production model has gradually evolved from traditional manual labor to scientific and technological information-based agriculture.In particular,with the promotion of the Internet of Things technology,many applications based on Internet of Things(IoT)have already landed in agriculture.In these applications,sensors generate data all the time,and huge amounts of data are formed over time.The monitoring of drinking water in rural areas is also the same.The rapid increase in the amount of data has led to a huge accumulation of data.The value of such a large amount of rural drinking water is huge.However,China’s current use of agricultural data is not very adequate for rural areas.The same is true for on-line monitoring data of drinking water.Most of the real-time monitoring data is directly discarded when there is not enough storage space,and its potential value is not fully tapped.At the same time,currently the storage tools used in rural drinking water monitoring projects are mainly based on relational data.When dealing with large amounts of data,such databases are not very good in data throughput performance,and due to their poor scalability,they are There are obvious deficiencies in managing dispersed data.This paper takes rural drinking water data as experimental objects,and addresses problems in storage capacity,throughput performance,data disaster recovery,and data reuse of relational databases in traditional rural drinking water monitoring projects,and studies distributed storage analysis platforms;With Hadoop cluster as the underlying storage architecture,relational data Mysql is used to store real-time property data of drinking water,and historical data in Mysql is migrated to the Hadoop cluster at fixed time,which solves the problem of Hadoop’s low latency operation during data visualization.At the same time,the data storage capacity and throughput of Mysql are solved.The data cleaning is performed in the Kafka cluster.The original data consumer is responsible for extracting the original data to the HDFS.The cleaning consumer is responsible for cleaning the default data and storing it in the Hive data warehouse.Data warehouse cycle statistics and real-time data in Mysql will be transmitted to the web front to visualize the data;subject based on the characteristics of drinking water data to improve the traditional distributed platform in file storage and management deficiencies,proposed attribute-based file consolidation Storage policies and file-based access A copy of the file of the dynamic management strategy.Through systematic experiments,the rural drinking water mass data storage and analysis platform can realize the storage,disaster recovery and reuse of the subject’s drinking water data,and the performance of reading and writing is obviously better and the traditional mass data storage and analysis platform. |