With the rapid development of China’s railway industry,the number and running volume of China’s railway vehicles are increasing day by day,many new problems have emerged in the subsequent development of railway vehicles.The main manifestation is the lack of a unified information management platform for the railway vehicles full life cycle cost,so that it cannot fully consider all aspects of the life cycle and it is difficult to provide sufficient data support.At the same time,with the continuous expansion of data scale,traditional relational databases suffer from performance crises and cannot store and process large-scale data effectively.In response to the above problems,this thesis designs and implements a railway vehicles full life cycle cost data management system by using the Hadoop ecosystem,which is used to achieve full digital management of the railway vehicles full life cycle cost to solve large-scale data storage and processing problems of the system.This thesis analyzes the life cycle cost and data from the four stages of decisionmaking design,procurement implementation,operation and maintenance,and scrap recycling of railway vehicles.The thesis proposed a storage architecture for collaboration between Hadoop and My SQL database.This architecture proposes a new solution to satisfy the data management demands of the railway vehicles full life cycle cost based on the collaborative storage idea of engineer Bukhari Syeda Sana,and conducts in-depth research on the key technologies.On the one hand,in order to satisfy the data transfer demands of the system,this thesis studies the existing data transfer algorithms and proposes a data transfer algorithm based on the multiple nested transfer.The core of the algorithm is to deal with the foreign keys of relational tables,and the basic idea is to attach the referenced table as the column family of HBase table to the table.The algorithm realizes the automatic data transfer of the railway vehicles full life cycle cost data management system through the data transfer of the main table,the directly related sub-table and the indirect related sub-table,the data migration results are complete and the data migration time range is acceptable.On the other hand,in order to achieve data access to heterogeneous databases,firstly,the method of parallel access to the data through the respective underlying query engines is designed to achieve transparent access to heterogeneous databases.The HBase database realizes simple HBase data access based on Key value query through HBase’s ’get’ and ’scan’ interfaces,and realizes complex HBase data access based on Hive QL query statements through integrated HBase / Hive architecture.Then we use Canal to parse the My SQL database update log Binlog to achieve data synchronization of heterogeneous databases.The unified query interface of heterogeneous databases is constructed by time stamp comparison to realize joint query of heterogeneous databases.And the data access performance of hybrid storage architecture and traditional database is compared through experiments.The experimental results verified the effectiveness of constructing hybrid storage architecture.In order to complete the analysis task more efficiently,the multiple linear regression model based on Map Reduce was studied to realize the repair cost of railcar repairs,and the effectiveness of the parallel regression model in big data analysis was verified through experiments.Finally,on the basis of automatic data migration and data access technology of heterogeneous databases,the railway vehicles LCC data management system based on Hadoop is designed and implemented. |