Vertical total variation for developing a scalable nearest neighbor classifier

Posted on:2007-04-21

Degree:Ph.D

Type:Dissertation

University:North Dakota State University

Candidate:Abidin, Taufik Fuadi

Full Text:PDF

GTID:1458390005483449

Subject:Computer Science

Abstract/Summary:

Recent advances in computer power, network, information storage, and multimedia have led to a proliferation of stored data in various domains, such as bioinformatics, image analysis, the World Wide Web, networking, banking, and retailing. This explosive growth of data has opened the need for developing efficient and scalable data-mining techniques that are capable of processing and analyzing large datasets. In data mining, classification is one of the important functionalities. Classification involves predicting the class label of newly encountered objects using feature attributes of a set of pre-classified objects. The classification result can be used to understand the existing objects in the dataset and to understand how new objects are grouped. In this dissertation, we focus our work on classification, more precisely on a scalable classification algorithm. We propose an efficient and scalable nearest neighbor classification algorithm that efficiently filters the candidates of neighbors by creating a total variation contour around the unclassified object. The objects within the contour are considered as the superset of nearest neighbors. These neighbors are identified efficiently using P-tree range query algorithm without having to scan the total variation values of the training objects one by one. The proposed algorithm further prunes the neighbor set by means of dimensional projections. After pruning, the k-nearest neighbors are searched from the pruned neighbor set. The proposed algorithm uses P-tree vertical data structure, one choice of vertical representation that has been experimentally proven to address the curse of scalability and to facilitate efficient data mining over large datasets. An efficient and scalable Vertical Set Squared Distance (VSSD) is used to compute total variation of a set of objects about a given object. The efficiency and scalability of the proposed algorithm are demonstrated empirically through experimentation using both real-world and synthetic datasets. The application of the proposed algorithm in image categorization is also discussed. Finally, the step-by-step integration of the proposed algorithm into DataMIME(TM) as a prototype of a new nearest neighbor classification algorithm that uses P-tree technology is also reported.

Keywords/Search Tags:

Nearest neighbor, Total variation, Data, Algorithm, Scalable, Classification, Vertical

Related items

1	Study On Generalized Nearest Neighbor Pattern Classification
2	Nearest Neighbor Classification Improved Algorithm
3	Vertical equal-interval neighborhood ring-based k-nearest neighbor/local support vector machine classification and applications
4	Study On Classification Algorithm Based On Natural Nearest Neighbor
5	Research Of Nearest Neighbor Classification Algorithm Based On Sample Selection
6	Mining Research, Based On The Integration Algorithm Of The K-nearest Neighbor Classification
7	Research On Several Pattern Classification Methods Based On K-nearest Neighbor Criterion
8	Classification Of Uncertain Data Based On Nearest Neighbor
9	Study On Improved LMS-KNN Nearest Neighbor Classification Method
10	Random K-Nearest Neighbor Algorithm With Application To Bankruptcy Prediction