Font Size: a A A

Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM

Posted on:2019-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:M M ZhangFull Text:PDF
GTID:2428330566974092Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of the economy,the amount of data has rapidly increased,and more and more data processing technologies have been emerged,such as data collection,storage and so on.However,when corporate decision-makers want to use these massive data to support their business decisions,they are often difficult to implement due to data quality problems,so that policy makers can not be in the minimum time from large amounts of data to extract effective information to help them make important decisions.It can be seen that the problems of data quality not only affect the data integration of multiple data sources,but also make the decision-makers lack the correct format of data in the appropriate time and place.Because the data warehouse data are came from various business data source,these data sources may be stored in different hardware platforms,using different operating system,due to various reasons,inevitably produce a lot of data quality problems,mainly as follows:(1)duplicate records;(2)abnormal records.The goal of data cleaning is to sort out and standardize data in data warehouse,eliminate ambiguity and improve data quality.Data quality has been become the main difficulty of data integration and other projects,so that many projects are difficult to achieve the desired goal.Therefore,data cleaning technology has been widely concerned by domestic experts and scholars,and they also improve or put forward the corresponding cleaning methods for Chinese data.In order to solve these problems,this paper collects domestic and foreign relevant data,and was based on the theoretical basis and practical application of the traditional Sorted-Neighborhood Method(SNM algorithm).The following points are the main contents of this article:(1)This paper was made a detailed description of relevant knowledge about processing data,made a brief overview of the research status at home and abroad,and introduced related concepts such as data cleaning,data quality,and duplicate records.(2)By introducing the traditional SNM algorithm theory,the defects of the algorithm are discussed,and the defects are improved.In this paper,the data set is preprocessed by the complement and segmentation method.(3)Chinese word segmentation is performed on the data set information in the product.This paper first introduces the commonly used Chinese word segmentation to briefly introduce it,then conducts experiments on some test data,and analyzes the results.The improved SNM algorithm performs Chinese word segmentation and the execution efficiency is improved,execution efficiency has obviously improved.(4)The improved algorithm is applied to the actual problem,and the improved SNM algorithm is used to clean the operator's 50000 commodity data sets.The experimental results show that the SNM algorithm can significantly improve the execution time of Chinese data cleaning in the same computing environment.The improved SNM algorithm has obvious advantages in the elimination of similar duplicate.
Keywords/Search Tags:data cleaning, SNM algorithm, duplicate records, data quality, the keyword
PDF Full Text Request
Related items