Font Size: a A A

Research On Data Cleaning Of Website Based On Hadoop Architecture

Posted on:2021-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:H L FanFull Text:PDF
GTID:2428330614455550Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of big data and artificial intelligence,the storage and mining of data has become increasingly important.At the same time,it has brought about concurrency issues with reduced data quality.An effective way to improve data quality is to cleaning data.Take the website user click logs data one month before the e-commerce company launched promotion as the cleaning object,and take appropriate cleaning methods to deal with the large number of repeated data problems.For the cleaning of log duplicate data,it is divided into two large blocks according type of duplicate data,one is completely duplicate data,and the other is similar duplicate data.The first is the cleaning of completely duplicate data.Used Hadoop distributed file system to store log data.Through it's copy management and heartbeat mechanism,it guaranteed the efficiency and integrity of log data.It used Map Reduce's powerful parallel computing capabilities combined with custom cleaning rules to count the number of completely duplicate data,then eliminated completely duplicate data.These Experiments can wash out all completely duplicate data.The second is cleaning similar duplicate data.The first step of cleaning similar duplicate data was accurately and efficiently identified similar duplicate data.For the detection of similar repeated data,Levlevenshtein distance algorithm based on character frequency is proposed.The algorithm considers the character frequency as the data information,which makes the character matching more comprehensive information.Through the comparison test between the Levlevenshtein distance algorithm and the Levlevenshtein distance algorithm bases on the character frequency,the latter improves the detection accuracy rate then the traditional edit distance algorithm.Finally,cleaning similar duplicate data is important and difficult.It proposed a nonfixed window Sorted-Neighborhood Method.The algorithm's window size bases on records' similarity,and it was calculated by matching the records of the window.The following comparison experiments are mainly performed: The first set of experiments,compared the cleaning effect of the log data by the traditional Sorted-Neighborhood Method with different window sizes,the window size with the best cleaning effect was 5.,The second set of experiments,the improved algorithm and the traditional algorithm are compared for the cleaning accuracy and cleaning time when the window sizes are 5 and 7,and then the accuracy of the non-fixed window Sorted-Neighborhood Method was tested with the same data,and it has a small improvement over the traditional algorithm.The third set of experiments modified keywords was compared with the second experiment to proved that the accuracy rate of non-fixed window Sorted-Neighborhood Method is improved.Figure 27;Table 13;Reference 55...
Keywords/Search Tags:data cleaning, Hadoop architecture, MapReduce, Levlevenshtein distance, Sorted-Neighberhood Method
PDF Full Text Request
Related items