Font Size: a A A

Research On Cluster Tree Method For Textual Stream Classification

Posted on:2014-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:S F BieFull Text:PDF
GTID:2298330422990423Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development and popularity of internet technology, alarge mount of continuous textual data stream is generated every now and then, eghourly news, e-mails, chatting record and so on.Inacademics and industry, muchmore attention has been paid on how to use this kind of data. Among textual datastream learning, classification of textual data stream is one of the most importantfields, which is applicable in the classification of spam email, terrorisminvestigation and other aspects.Aiming at the problem of textual streamcategorization, this paper hascarriedout deeply research.Completed works are describled as follows:(1) After analyzingthe type of textual stream generating process, properties ofthis kind of data, chanllenges right now etc, we summarizemethodson textual streamclassificationand the classification methods based on clustering technique, at homeand abroad.(2) Based on previous achievement, a cluster tree method with labelinformation (called CTL) is proposed. In training stage, CTL algorithmuses the labelinformation more rationally, and considers both attributes similarity and labelsimilarity, simultaneously, which makes a more reasonable cluster tree. Furthermore,CTL applies a new clustering algorithm, which updates the cluster centoids based ontheir importance.In the experiment section, on high dimensional data, CTLalgorithm has a couple of advantages over Cluster Tree algorithm, other treeclassifiers (C4.5, CART and Random Forest) and SVM algorithm.(3) For the problem of textual stream categorization, a dynamic ensembleclassificationalgorithm with CTL is proposed.This algorithm uses CTL as the baselearner, and views recent data block as its validation set for the accuracycomputation, and applies two different weighting ways, subsequently. In theexperiment section, compared with different combination of four ensemble methodsand three base learners, this algorithm has its effectiveness on different datasets.
Keywords/Search Tags:cluster tree, textual data stream, concept drift, dynamic ensemblemethod
PDF Full Text Request
Related items