The solar wind is the outermost layer of the atmosphere of the sun projecting high-speed stream of charged particles that mainly composed of protons and electrons to the surrounding space. It has a great impact on the formation of the Earth-Moon space environment, especially the solar storms can cause great harm to humans, for example, geomagnetic storms, ionospheric storms, and affect communication, the impact of the transmission, oil and gas pipeline system security, and so on. This paper is about the classification of the solar wind, it can be to better understand the characteristics of the solar wind, in order to better predict and prevent solar storms, as much as possible to reduce the effects that solar storms bring to the earth.For a large number of solar wind data, revealing its underlying information, using Data mining technology as a tool is taken for granted. Data mining is a process to reveal potential, unknown and valuable information from large amounts of data. Data mining classification algorithm is a method for identifying category of sample data, a variety of classification algorithms have been proposed, including Naive Bayesian Model, Decision Tree Model and so on. They all have their advantages and applications. C4.5is one of classic decision tree algorithm, which is improved on the ID3algorithm. Using information gain ratio to select properties overcome the lack of values tend to choose more attributes when using information gain to select attributes; Using Pre pruning able to handle a continuous variable properties; Capable of incomplete data processing and classification rules are easier to understand with high accurate rate. Naive Bayesian classification algorithm derived from a solid mathematical foundation, it has stable classification efficiency. Its principle is based on Bayes’ theorem. Calculating posterior probability based on priori probability. And put the objects into the class with the maximum a posterior probability. This idea is very simple and intuitive. This article focuses on these two algorithms models.C4.5needs to sequential scan and sort data sets too many times in the process of constructing a decision tree so that inefficient algorithm. No matter what Bayesian and Decision Tree always seemed inadequate when dealing with large datasets. To improve efficiency, the paper parallels data processing algorithms on the Hadoop platform.Hadoop Distributed File System uses a write-once and read-many efficient access mode. Rapid access data with the form of streaming ensure consistency when reading and writing. And it has good fault tolerance and high transmission rate that can quickly detect and recover hardware failures. And it has a simple and reliable communication protocol. In addition, MapReduce has a high-performance computing, the data is distributed to the machines of the cluster, trying to store data on the compute nodes to achieve the data localization for quick access, each task is independent in MapReduce. And it can achieve their failure detection with high reliability. So the MapReduce model can be used to parallelize on a Hadoop cluster, easily to handle large data sets. Therefore, the paper about research of data mining classification algorithm based on cloud computing and the solar wind data has important significance.The paper is about that predicting the scope of proton speed according to proton density and proton temperature and He4toprotons. Considering that the big data processing based on traditional data mining technology and architecture with high hardware resource requirements and extremely low efficiency, we use cloud computing to achieve data mining technology on the Hadoop platform. C4.5Decision Tree and Naive Bayes algorithm, the application of improved C4.5algorithm and a new algorithm (Bayesian-C4.5.1-Tree) based on them.Completion of the work and research contents of the article:First, according to the traditional C4.5decision tree algorithm having poor performance we proposed solutions. The use of Hadoop platform solves the contradictory of big data and small memory capacity. We improve the method of discretization of continuous attributes to solve the traditional method of excessive computing problems. The solar wind data were classified with C4.5algorithm that is improved, and we get a classification model.Second, the Naive Bayes algorithm is applied to the solar wind data classification process, and is achieved on Hadoop platform. And we get the classification results.Third, Analysis of The classification results of solar wind data based on improved C4.5algorithm and Naive Bayes algorithm, we propose a new algorithm that was the combination of these two methods,we call it as Naive Bayesian decision tree algorithm. We implement the algorithm on Hadoop to process the solar wind data, and get results.Fourth, the performance of these three algorithms in the classification of solar wind data were compared and analyzed to obtain the optimal algorithm. |