| As most consumer channels shift from offline to online,the emergence of various online transaction data of e-commerce platforms in the online retail market has formed big data related to commodity prices.Some scholars have adopted these big data to study online price index based on crawler technology.However,with the gradual improvement of the anticrawler mechanism of e-commerce platforms.There will be high cost of grasping these big data based on crawler technology,and it will even affect the continuous monitoring of online retail market.Therefore,it is a very important thing to monitor the economic behavior of the network retail market in a low cost and high efficiency.In order to solve the existing problem of using crawler technology to capture e-commerce big data to construct price index,this paper designs a sampling survey method.Based on the "big data-small data" perspective,the method constructs a complete commodity URL list in the base period of the construction of the online retail commodity price index,fully crawls the commodity retail data of the e-commerce platform,and determines the crawler data sampling frame.Through vectorization of commodity text information and commodity classification model built based on random forest algorithm,accurate recognition of commodity categories in crawler data sampling box is realized.In the continuous survey,based on the idea of using auxiliary information to carry out hierarchical survey,combined with the problem of too many dimensions in the sampling population of crawler data,the multi-dimensional clustering analysis method in machine learning was used to cluster the goods with high similarity to realize the overall stratification of crawler data sampling,and the contour coefficient was used to compare the similarity degree of each cluster and determine the number of layers of optimal stratification.Considering the large differences among sampling units in the sampling population of crawler data,unequal probability random sampling was adopted to select representative samples,and progressive sampling strategy was used to determine the optimal total sample size taking into account both cost and accuracy.At the same time,the representative samples selected in the continuous investigation of commodities in the online retail market may have no answer phenomenon,so in order to prevent it from affecting the construction accuracy of the online retail commodity price index,the idea of formal and backup samples is proposed.For each formal sample,the nearest neighbor matching method based on the improved KD-tree algorithm is used to select part of the backup samples.An adaptive sample usage method is proposed to substitute the formal samples without answers,and then the construction of online retail commodity price index is realized.Taking the grain and oil products on Taobao and Jing Dong platforms as examples,the results show that: In the crawling data of the e-commerce platform,the total sampling data volume of the crawler data of the two platforms in the base period is 18,346 and 25,497,respectively.The average sampling ratio of the continuous survey in the 2-12 period is 9.63%and 5.24%,respectively,and the average relative error of sampling is 0.99% and 0.37%,respectively,indicating that the method in this paper is effective.This method will greatly reduce the cost of constructing the online retail commodity price index,and provide a feasible scheme for online commodity operators and scholars to understand the commodity price fluctuations in the online retail market for a long time. |