| Albacore tuna has become one of the main fishing targets of the world tuna fishery due to its high economic benefits and rich resources.By studying the albacore tuna fishing data found in recent years,the Pacific albacore tuna fish is mainly distributed locations,including the south Pacific(in the equator is bounded)that is one of the China’s most important fishing grounds.Since 2001,our country in the south Pacific tuna fishing yield showed a trend of increasing.However,in recent years,more and more countries and regions have joined the development of the fishery resources in the south Pacific.Therefore,the research of albacore tuna fisheries prediction can effectively improve the fish production,for fishery production and strengthen the status of our country in the world’s fisheries organization is of great significance.With the development of science and technology,it is easier to obtain the data of Marine environmental factors affecting the distribution of fisheries,which makes the amount of data more and more large.The analysis and research on this kind of massive data is one of the research hotspots in the world fishery department.The traditional linear model cannot accurately analyze the key factors in the face of the complex and changeable Marine environmental data.At the same time,the large amount of data leads to large errors in the analysis and prediction of the model.With the continuous in-depth research in the field of computer,machine learning technology can make more efficient use of data in the era of big data and conduct in-depth analysis on complex and changeable massive data.In order to more accurately predict the distribution of albacore fish in the South Pacific,improve the longitude of the forecast,and at the same time discover the relationship between tuna production and marine environmental factors,this project is based on the albacore in the South Pacific during the fifteen years from 2000 to 2015.Longline fishing production data and marine environmental factors such as sea height,chlorophyll a concentration and sea surface temperature,as well as spatio-temporal data such as month,longitude and latitude,are combined and superimposed,using an integrated learning model—Light GBM to perform Fishery forecast.The main research contents of this paper are as follows:(1)In fishery forecast study,different environmental factors and temporal and spatial factors are not the same as the degree of impact on fish distribution.This study uses feature_importances_in machine learning method to calculate the important degree of each factor characteristic,namely the size of the effect on fish distribution.On the basis of the input factor importance index,in subsequent data preprocessing,the factors with relatively small importance are optimized or replaced.We use this method to deal with high-dimensional complex Marine environment data and utilize the Light GBM model to forecast the optimized input factor.The Bayesian optimization algorithm based on Gaussian process is used to tune the hyperparameters of the model,so that the Light GBM model performance is in the optimal state,and the optimal prediction accuracy of the final model can reach 72.7%,compared with other models,the accuracy rate has been significantly improved.Finally,the prediction test was carried out by using the data of2015,and the test results showed that the predicted fishing ground was consistent with the real fishing ground.(2)Aiming at the problems such as low processing efficiency,long service time of massive and complex fishery data,the parallel optimization of Light GBM model based on Spark is studied.This study mainly combines the distributed memory calculation of Spark with the unique parallel learning algorithm of Lightg BM model to improve the computational efficiency when facing large-scale data sets.In the research problem of fishery forecast,the Light GBM model running on the multi-node Spark cluster has a significantly improved acceleration effect compared with a single node,which indicates that the algorithm in this study has a higher parallel efficiency under the large data scale. |