Application Of Distributed Data Mining On Web Log Analysis

Posted on:2016-10-23

Degree:Master

Type:Thesis

Country:China

Candidate:X R Yao

Full Text:PDF

GTID:2298330467493200

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, web technology is widely used. Internet users increase and user-generated data surge.These data often contains potentially values and disciplines which can be derived from applying data-mining algorithms to the web log data.From the mining results,site administraters can learn more about the needs of users and bring the enterprise value. However, the traditional centralized data mining algorithms can not deal with large amounts of data very well. So this article is intended to improve the data mining algorithms with distributed technology and apply them in the field of web log mining.Firstly,in the web log preprocessing phase, analyze the current session recognition method and find the defects.Then this article puts forward the personalized session identification algorithm,which can split the session according to the threshold based on userâ€™s behavior,that is to say the threshold is setted to be different value according to different users.Besides the preprocessing and Hadoop platform will be combined to increase the processing speed.Secondly, because frequent pattern mining is based on dynamic clustering decisions, so the second part mainly realizes DBDC clustering algorithm under Hadoop platform. Customized similarity measurement based on prefixes is used in clustering phase considering the characteristics of web log data.During the partial clustering,the way to dealing with the partial noise data result in the unaccuracy of clustering.So this article improves the aspect.The improved DBDC,ie.D-DBDC, cluster the noise data to avoid lose clusters. In the partial adjustment stage,D-DBDC makes revisement to adjust to the web log data mining.Thirdly,for frequent pattern mining in web log, it analyzes that users often have multiple topics of interest.So D-FP-Growth algorithm decides which method to be used according to the dynamic clustering numbers.D-FP-Growth contains two strategies.One is the distributed FP-Growth based on horizontal division of transaction database.Itâ€™s used when thereâ€™re more clustering numbers.The other is vertical division of transaction database.Itâ€™s used when thereâ€™re less clustering numbers.D-FP-Growth also takes full advantage of Hadoop computing capability of each node using its policy of balance.Finally, the mining algorithms were tested using multiple sets of data to prove the effectiveness of personalized session identification algorithm.It indicates the accuracy of D-DBDC algorithm increases.And D-FP-Growth algorithm not only reduces repetitive tasks among the distributed nodes, and the amount of calculation can also get a good balance for each node.Thus the overall average running time of the algorithm is reduced.

Keywords/Search Tags:

Web mining, Personalized session identificationalgorithm, D-DBDC, D-FP-Growth, Hadoop

PDF Full Text Request

Related items

1	Research On The Vertical FP-growth Mining Algorithm Based On Hadoop With Load Balancing
2	The Reach Of Personalized Recommendation Systems Based On The Web Log Mining
3	Improved Parallel Fp-Growth Algorithm Based On Hadoop
4	Research On Mining Personalized Information In Online Interaction And Learning Platform
5	The Research Of Personalized Users' Profile Based On Agent And Web Mining
6	Mining Association Rules Algorithm Analysis Based On Hadoop
7	The Research And Application Of Personalized Recommendation Based On Web Mining
8	Research On Association Rules Mining Methods Of Mass Engineering Data Based On Hadoop
9	Design And Implementation Of Analysis System Based On Web Log
10	Algorithm Research On Session-Based Recommendation System