Font Size: a A A

Application Of Distributed Data Mining On Web Log Analysis

Posted on:2016-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:X R YaoFull Text:PDF
GTID:2298330467493200Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, web technology is widely used. Internet users increase and user-generated data surge.These data often contains potentially values and disciplines which can be derived from applying data-mining algorithms to the web log data.From the mining results,site administraters can learn more about the needs of users and bring the enterprise value. However, the traditional centralized data mining algorithms can not deal with large amounts of data very well. So this article is intended to improve the data mining algorithms with distributed technology and apply them in the field of web log mining.Firstly,in the web log preprocessing phase, analyze the current session recognition method and find the defects.Then this article puts forward the personalized session identification algorithm,which can split the session according to the threshold based on user’s behavior,that is to say the threshold is setted to be different value according to different users.Besides the preprocessing and Hadoop platform will be combined to increase the processing speed.Secondly, because frequent pattern mining is based on dynamic clustering decisions, so the second part mainly realizes DBDC clustering algorithm under Hadoop platform. Customized similarity measurement based on prefixes is used in clustering phase considering the characteristics of web log data.During the partial clustering,the way to dealing with the partial noise data result in the unaccuracy of clustering.So this article improves the aspect.The improved DBDC,ie.D-DBDC, cluster the noise data to avoid lose clusters. In the partial adjustment stage,D-DBDC makes revisement to adjust to the web log data mining.Thirdly,for frequent pattern mining in web log, it analyzes that users often have multiple topics of interest.So D-FP-Growth algorithm decides which method to be used according to the dynamic clustering numbers.D-FP-Growth contains two strategies.One is the distributed FP-Growth based on horizontal division of transaction database.It’s used when there’re more clustering numbers.The other is vertical division of transaction database.It’s used when there’re less clustering numbers.D-FP-Growth also takes full advantage of Hadoop computing capability of each node using its policy of balance.Finally, the mining algorithms were tested using multiple sets of data to prove the effectiveness of personalized session identification algorithm.It indicates the accuracy of D-DBDC algorithm increases.And D-FP-Growth algorithm not only reduces repetitive tasks among the distributed nodes, and the amount of calculation can also get a good balance for each node.Thus the overall average running time of the algorithm is reduced.
Keywords/Search Tags:Web mining, Personalized session identificationalgorithm, D-DBDC, D-FP-Growth, Hadoop
PDF Full Text Request
Related items