Font Size: a A A

Implementation Of Distributed Hierarchical Clusterting Algorithm Faced To Huge Commodity Dataset

Posted on:2018-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L ZhouFull Text:PDF
GTID:2348330515459754Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
According to the progress of computer science and information technology,companies are able to collect and store large amounts of data.But the data stored only takes up a lot of storage space,it can not have an effective help on companies' value,so companies began work on mining information from the data.In the past,the information mining process was analyzed and interpreted by experts,which became more and more difficult with the rapid increase of data volume and attributes.Therefore,how to effectively discover knowledge from the huge database,and further processing into indispensable business intelligence,has gradually become the most important issue that twenty-first century companies must face to.In production practice,the increasing speed of data and the time consumed by data analysis has become more and more prominent contradictions.Data mining is the analyse technology to solve the problem of traditional analytical methods,for large-scale data analysis and processing.By self-learning algorithm,data mining is able to obtain knowledge and information hidden in large-scale data set.Customs as the main commodity import and export regulatory unit,is the mass import and export data producers and owners.With the deepening and improvement of the business information processing,the customs have basically achieved a more complete data-based regulation and digital operation capacity.Meanwhile,the Customs is also one of the first government departments to carry out the application of data analysis.Through continuous accumulation and development,the customs analysis work has gradually expanded from the initial statistics to risk analysis,tax analysis and forecasting,intelligence analysis and analysis of various business areas.The role of data analysis in decision-making,business monitoring,risk prevention and problem detection,is increasingly obviously.This paper will take the Customs Data Analysis Project as the main line,and propose a data modeling and analysis system based on Hadoop and MapReduce,implement a series of processing modules for commodity data,and form a distributed clustering system for commodity data.The main contents include preprocessing of commodity data,TF-IDF calculation,inverted index construction,similarity matrix calculation,single join hierarchical clustering calculation and so on.At the end,the results of hierarchical clustering are used to sort the commodity data of the customs,which provides accurate statistical basis for statistical analysis and judgment module of customs,and has produced effect in practical application.
Keywords/Search Tags:Data mining, Text clustering, Hadoop, MapReduce
PDF Full Text Request
Related items