With the rapid development of the Internet,all kinds of social networks,e-commerce networks,online document platforms show explosive growth.At the same time,the number of pictures,text documents,audio and video files in network data increases exponentially.Traditional data access and retrieval can not meet the current needs,especially in low latency,high accuracy and other application scenarios.Using cloud computing to store and retrieve massive data can achieve efficient use of hardware resources,and avoid the drawbacks of traditional data storage methods.In the current mainstream cloud computing platform,Hadoop has become the preferred solution because of its ecological integrity,full open source and other characteristics.The core components of Hadoop include parallel computing model MapReduce and distributed file system HDFS.HDFS is a distributed file system designed to handle large files and deal with large amounts of data.However,in dealing with large amounts of small files,there are problems such as high memory burden and low access performance.This paper analyses the business characteristics of online document platform,analyses the performance challenges HDFS encounters in the massive small file scenario,and designs and implements a massive small file access system application based on Hadoop.In order to satisfy the high concurrent random write and read access requests,the system adopts the two-level storage architecture of "local storage-HDFS",which not only meets the high concurrent read and write requirements of the system,but also provides a linear scalable mass storage capacity.The main contents of the research include the following three parts:1.Emphasis is placed on HDFS,and the principle and advantages and disadvantages of HDFS's own storage methods are analyzed.2.The principle of retrieval and the realization of full-text retrieval system are studied and discussed.3.Combining with the business characteristics of the online document platform and HDFS architecture characteristics,a distributed access system for large and small files based on HDFS is designed.It innovatively designs the functions of merging and saving large and small files and pre-reading related documents.It makes full use of Hadoop's characteristics of mass storage and high fault tolerance,and avoids Hadoop.The disadvantage of not being able to access large amounts of small files efficiently is to achieve high throughput and low latency in specific business scenarios.The system implementation ideas and coding implementation are described in detail. |