Font Size: a A A

Research On New Technology In Data Mining Field

Posted on:2008-12-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:T WangFull Text:PDF
GTID:1118360245991013Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, electronic commerce, electronic government, and information retrieval online have been used more frequently. The Internet services need has been the power for its development. But as for the huge data online and too many website can be selected, people may feel in lose. How to make the services online fit for the personal requirement has been a crucial problem to its provider. The users'traversal patterns discovery is the key problem, by which the gap between the users and providers can be filled. To discover the users'traversal patterns is one of the goals for Web Data Mining.Web Data Mining includes Web usage mining, Web structure mining and Web context mining, and the Frequent Traversal Sequence discovery is the main means for the users'traversal patterns discovery, so it is one of the important tasks of Web usage mining. Web usage mining can earn knowledge from the Web Log or users'motivation, and discover the relationships among the different users'behavior. The mining result can be used to modify the designing of the Web sites and improve the service mode to fit users'personal need. In this paper, by studying of the existent algorithms for Web Log mining and the database design feature of OLTP and OLAP and other knowledge related, an improved AprioriAll algorithm has been put forward. By slicing up the Web Log data in the light of the UserID during the mining process, not only the users can be as a whole to be mined, but also can be mined separately. By this way, the mining result can be used to fit the need of the personal requirement. The improved AprioriAll algorithm also realizes increasing mining for Web Log, and makes the dynamic Web Log mining possibly. Test show the improved algorithm reduces the times of database scanning and the size of the candidate set, which be generated during the mining process. So the cost of time and space consuming are saved.Because of the huge memory consume for presenting and storing the Web sessions, and the drawbacks of the Apriori-like algorithms, which produce too many candidates and require too many times of database scanning during the mining processing, a technology for encoding the Web transactions and an Inv-Apriori algorithm have been designed. The Web transactions encoding technology only uses a number to denote a Web session, which can compress the Web transactions database and the money can be saved. The Inv-Apriori algorithm can discover the maximal frequent traversal sequence conversely and then to find the association rules, so many trivial steps can be omitted.By analysis the users'performance when they surf the Internet and the response of the websites when they get the users'request, this paper give a method to utilize the Web page duration time, which can be used as the users'reaction to the page, to mining Web Log. By setting the duration time range before mining, the scope of mining can be selected and shrinked, so the communion between the algorithm and the miner can be improved. By the pretreatment process, the Web Log can be changed into a record set with duration time field firstly, and then building frequent traversal sequence tree with duration time constrain and write down the present count of each page. In the tree, the record set can be stored and compressed. Finally, under the limitation of the minimal support, this algorithm searches the tree by depth first to finds the duration time maximal frequent traversal sequence. Comparing with other algorithms, this method has a high efficiency.The fuzzy neural network technology is one of the hot topics of Data Mining. According to the Max Similarity Rule, this paper sets forth the cross entropy theory with formulae deduction in detail and a new activation function. Compare with the BP algorithm (error back propagation), which based on the error square sum rule and Sigmoid or Hyperbolical function, the classify algorithm based on the cross entropy theory and the new activation function can speed up the learning process and at same time without worry about to put it into the non-convergence state or lose in local small point. The new activation function not only has the value range from 0 to 1, but can also tune up the learning speed by adjusting its slope. So it can improve the algorithm's dynamic performance and make the processing of the FNN get into convergence as soon as possible, which can improve the algorithm's efficiency.In order to identify the user in the Web mining, an idea, which uses the biometric recognition technology, is put forward. At the same time, a method based on the Hidden Markov Model to build an iris identification system is proposed either. The robust of iris matching can be achieved by only depending on the orientation field of the iris and less sensitive to the noise and the distortions of the iris image than the conventional approaches in which need many iris details. This method is more efficient by simpling the pretreatment processes. By recognizing the users affirmatively, the shortcoming of the current Web architecture, the non-state property, can be overcome. By this way, the Web log can be sliced up by the user dimension, so not only the users can be as a whole to be mined, but also can be mined separately. This makes the mining result can be used to fit the need of the personal requirement. It can also realize the increasing mining and dynamic mining as long as the idea to be achieved.
Keywords/Search Tags:User Dimension, Encoding Web Log Record, Inv-AprioriAll Algorithm, the Duration Time of Web Page, Cross Entropy Function, Activation Function, Biometric Recognition
PDF Full Text Request
Related items