Font Size: a A A

Vertical total variation for developing a scalable nearest neighbor classifier

Posted on:2007-04-21Degree:Ph.DType:Dissertation
University:North Dakota State UniversityCandidate:Abidin, Taufik FuadiFull Text:PDF
GTID:1458390005483449Subject:Computer Science
Abstract/Summary:
Recent advances in computer power, network, information storage, and multimedia have led to a proliferation of stored data in various domains, such as bioinformatics, image analysis, the World Wide Web, networking, banking, and retailing. This explosive growth of data has opened the need for developing efficient and scalable data-mining techniques that are capable of processing and analyzing large datasets. In data mining, classification is one of the important functionalities. Classification involves predicting the class label of newly encountered objects using feature attributes of a set of pre-classified objects. The classification result can be used to understand the existing objects in the dataset and to understand how new objects are grouped. In this dissertation, we focus our work on classification, more precisely on a scalable classification algorithm. We propose an efficient and scalable nearest neighbor classification algorithm that efficiently filters the candidates of neighbors by creating a total variation contour around the unclassified object. The objects within the contour are considered as the superset of nearest neighbors. These neighbors are identified efficiently using P-tree range query algorithm without having to scan the total variation values of the training objects one by one. The proposed algorithm further prunes the neighbor set by means of dimensional projections. After pruning, the k-nearest neighbors are searched from the pruned neighbor set. The proposed algorithm uses P-tree vertical data structure, one choice of vertical representation that has been experimentally proven to address the curse of scalability and to facilitate efficient data mining over large datasets. An efficient and scalable Vertical Set Squared Distance (VSSD) is used to compute total variation of a set of objects about a given object. The efficiency and scalability of the proposed algorithm are demonstrated empirically through experimentation using both real-world and synthetic datasets. The application of the proposed algorithm in image categorization is also discussed. Finally, the step-by-step integration of the proposed algorithm into DataMIME(TM) as a prototype of a new nearest neighbor classification algorithm that uses P-tree technology is also reported.
Keywords/Search Tags:Nearest neighbor, Total variation, Data, Algorithm, Scalable, Classification, Vertical
Related items