Font Size: a A A

Research On WEB Data Mining

Posted on:2007-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1118360218457117Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer networks, the amount of online data alsoexpands rapidly and data stream model has appeared in a growing number of webinformation processing applications. These applications include network security,network traffic monitoring, sensors network and so on. Under these circumstances,data arrives in unbounded, rapid and time-varying data stream way and data streammodel suit the web application demand more than traditional persistent relations do.Hence, data stream processing has gained a high attention and becomes a hot point inweb data mining.Data stream classification is an important research field of data stream mining.Traditional classification potentially assumes that the data comes from a database ordata warehouses, which is static or update seldom. However, because of thecharacteristics of data stream, devising algorithms for mining data streams encountersthe following great challenges. Firstly, the algorithm must use only a fixed amount ofmain memory, indenpent of the total number of records it has scaned. Secondly, as itmay not have time to revisit old records and the data may not even all be available insecondary storage at a future point in time, it must be able to build a model using atmost one scan of the data. Thirdly, learning algorithms must adapt to concept driftingin data streams quickly. Fourthly, it must create a usable model available at any timepoint.This dissertation pays special interests to data stream classification and it alsopays some interests to web data mining application. This research is partly supportedby National Natural Science Foundation (60373108) and Specialized Research Fundfor the Doctoral Program of Higher Education (2069901).This dissertation studies the theory of data stream classification, with thefollowing main research results:1. In order to deal with recurring contexts effectively, a model of data streamshistory and RTRC system have been proposed, which has good classificationaccuracy even when concept drifts after it has scanned enough samples in the data stream. By Markov chain and least-square method, the system learns to predictwhat the next concept is and when the concept will drift. Extensively experimenton the benchmark and synthetic dataset have been conducted. Experimentalresults confirm the advantages of this system over Weighted Bagging and CVFDT,two representative systems for mining data streams.2. The two main challenges associated with mining data streams are concept driftingand data noise. A clustering-based method has been presented to filter out hardinstances and noise instances from data streams. This dissertation also proposes atrigger to detect concept drifting and build RobustBoosting, an ensemble classifier,by boosting the hard instances. On the synthetic and real-life data setsRobustBoosting algorithm and AdaptiveBoosting algorithm have been compared.The experiment results show that the proposed method has substantial advantageover AdaptiveBoosting algorithm in prediction accuracy, and that it can convergeto target concepts efficiently with high accuracy on datasets with noise level ashigh as 40%.3. Many researchers have presented learning systems that assume the presence ofhidden context and concept drifting. In particular, several systems have beenproposed that use ensembles of classifers on sequential chunks of instances. Thesesystems can respond to gradual changes in data streams, but have problemsresponding to sudden changes. In order to sovle this problem, the reverse classifierhas been defined, which can help an algorithm to learn from error, and Thisdissertation also presents the IWB(Improved Weighted Bagging) algorithm forclassifying concept drift data streams using weighted ensemble classifiers.Considerable experimental work has been conducted in order to evaluate IWBalgorithm and WeightedBagging algorithm on the STAGGER Concepts dataset.The experiment results show that the proposed method is very successful.4. State-of-art work on mining data streams concentrates on capturing time-evolvingtrends and patterns with labeled data. However, in most real-word problems,labeled data streams are rarely available immediately. In order to solve thisproblem, a method that can detect changes in data streams without knowing thetrue class labels has been proposed, which is based on cumulative sum (CUSUM)control chart. Considerable experimental work has been performed to prove thefeasibility of our method.This dissertation also studies the research on web data mining application, withthe following results:1. A web community is a collection of web pages created by individuals or any kind of associations that have a common intreste on a specific topic. A novel algorithmhas been presented, which takes advantage of both hyper-link and text in the webpages, to mine web community by SVM classification. The paper reports theexperimental result conducted on WEBKB data collection, which contains8282 web pages. The result demonstrates that the proposed approach can mine outbig meaningful communities.2. Currently, researchers have proposed several sequential association-rule modelsfor predicting the next HTTP request. These researches focus on using sequenceand temporal constrains for pruning to improve prediction precision. Acomparative study on different kinds of sequential association rules for webdocument prediction has been provided. Firstly, this dissertation gives algorithmson mining sequential association rules, which based on sequence and temporaldifferent combination. Then, the performance of all such algorithms has beencompared on a real web log dataset. Based on the comparison, using Analysis ofVariance method, the effect of sequence and temporal information on influencingthe precision of prediction has been explored. The results show that the sequenceconstrains and temporal constrains can affect the precision of prediction, theinteraction between the sequence constrains and temporal constrains can alsoaffect the precision of prediction, and that temporal constrains can affect morethan sequence constrains.
Keywords/Search Tags:Data Stream, Concept Drifting, Ensemble Classifier, Recurring Context, Web Community
PDF Full Text Request
Related items