Implementation Of Distributed Hierarchical Clusterting Algorithm Faced To Huge Commodity Dataset

Posted on:2018-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:J L Zhou

Full Text:PDF

GTID:2348330515459754

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

According to the progress of computer science and information technology,companies are able to collect and store large amounts of data.But the data stored only takes up a lot of storage space,it can not have an effective help on companies’ value,so companies began work on mining information from the data.In the past,the information mining process was analyzed and interpreted by experts,which became more and more difficult with the rapid increase of data volume and attributes.Therefore,how to effectively discover knowledge from the huge database,and further processing into indispensable business intelligence,has gradually become the most important issue that twenty-first century companies must face to.In production practice,the increasing speed of data and the time consumed by data analysis has become more and more prominent contradictions.Data mining is the analyse technology to solve the problem of traditional analytical methods,for large-scale data analysis and processing.By self-learning algorithm,data mining is able to obtain knowledge and information hidden in large-scale data set.Customs as the main commodity import and export regulatory unit,is the mass import and export data producers and owners.With the deepening and improvement of the business information processing,the customs have basically achieved a more complete data-based regulation and digital operation capacity.Meanwhile,the Customs is also one of the first government departments to carry out the application of data analysis.Through continuous accumulation and development,the customs analysis work has gradually expanded from the initial statistics to risk analysis,tax analysis and forecasting,intelligence analysis and analysis of various business areas.The role of data analysis in decision-making,business monitoring,risk prevention and problem detection,is increasingly obviously.This paper will take the Customs Data Analysis Project as the main line,and propose a data modeling and analysis system based on Hadoop and MapReduce,implement a series of processing modules for commodity data,and form a distributed clustering system for commodity data.The main contents include preprocessing of commodity data,TF-IDF calculation,inverted index construction,similarity matrix calculation,single join hierarchical clustering calculation and so on.At the end,the results of hierarchical clustering are used to sort the commodity data of the customs,which provides accurate statistical basis for statistical analysis and judgment module of customs,and has produced effect in practical application.

Keywords/Search Tags:

Data mining, Text clustering, Hadoop, MapReduce

PDF Full Text Request

Related items

1	Design And Implementation Of Clustering Algorithm For Large Scale Chinese Short Text Based On Mapreduce
2	The Research Of Clustering Mining Based On Logistics History Data On The Hadoop
3	Implementation Of Distributed Hierarchical Clusterting Algorithm Faced To Huge Commodity Dataset
4	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
5	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
6	Research And Application Of Text Mining Based On Hadoop
7	Research And Implementation Of Text Clustering Based On AP Algorithm
8	Research Of Clustering Mining Algorithm Oriented Big Data
9	Research And Application Of Data Mining Algorithms Using Mapreduce
10	Research, Design And Application Of Clustering Algorithm Using Mapreduce