Font Size: a A A

Distributed E-commerce Site Data Dynamic Detection And Reching System Design And Implementation

Posted on:2017-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:C Z LuFull Text:PDF
GTID:2348330518996596Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the growth and the vigorous development of e-commerce,e-commerce sites are more and more popular,the size of e-commerce data present explosive growth.Due to the Online shopping has become more and more became a part of daily life,people of e-commerce sites data has become the researchers' important research objects of People's Daily economic activities,so it is important to gathering e-commerce sites information efficiently.But e-commerce sites exist not only a large amount of data,but there are a lot of redundant data,and a lot of redundant data will seriously affect the efficiency of business data collection and accuracy of the data,so in order to guarantee the dynamic of the e-commerce data fetching,must be dynamic detection during the process of fetching data.Now there are many data detection algorithm,but these algorithms are relatively universality,didn't take full advantage of the characteristics of e-commerce sites.So in this paper,firstly research and summarize the characteristics of the domestic each big mainstream e-commerce sites,through based on the characteristics of e-commerce sites this paper proposed bloom filter which based on site's characteristics and fingerprint algorithm of web pages,finally using the new algorithm designed and implemented a distributed e-commerce site data check system.(1)The bloom filter algorithm based on site characteristics.This section for real-time analysis of e-commerce sites page for the special requirements of efficiency,are analyzed based on the principle of the traditional bloom filter,points out its url to check ignored the url information redundancy defects,put forward an improved method,basedon the url of the site feature extraction method of bloom filter.This method first define the website characteristic;And by the improved algorithm corresponding to quantify,extraction;According to the characters of web site url filtering rules training;And based on the rules to remove redundant information.Through the experiment of more than 200 ten thousand data found that the time of the improved bloom filter efficiency have made a lot of ascension,and efficiency improvement is more obvious with the increase of amount of data time,proves that the proposed method is effective,and can well meet the application requirements.(2)Fingerprint algorithm for web pages which is based on the url.Through the analysis of e-commerce sites,when more than one network address corresponding to the same page,two url similarity is very large;At the same time,through the traditional algorithm of rechecking must download the web page to check during the course of web page rechecking,so it can't improve page collection efficiency.Based on these two points,this topic is proposed algorithm for web page which based on the url fingerprint,the algorithm based on url attribute extraction,quantification,fingerprint extraction training website fingerprint,then judgment website by comparing similarity.Finally by 2.2 million data's experimental,the author found out that fingerprint algorithm for web pages that is based on the url to ensure that the margin of error smaller(1%),the time of the under the premise of efficiency increased by 11%,and the effect is more obvious with the increase of the amount of data.(3)Based on the theme of the design and implementation of distributed rechecking system.First analysis principle and the defects of traditional bloom filter,we design a distributed rechecking system based on topic,in order to guarantee the efficiency of rechecking during the course of a distributed system,the reliability and maintainability,This section used the rechecking method which is studied in third chapter and the fourth chapter studies the and through the zookeeper and thrift framework implements the system.At last,through analysis the new system and the traditional distributed rechecking system,the new system has good maintainability,reliability,comparing and more efficiently.
Keywords/Search Tags:e-commerce site data, dynamic detection, distributed rechecking system
PDF Full Text Request
Related items