Font Size: a A A

A Scalable Core API's Design And Implementation Of A Text Classification System

Posted on:2004-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:J DiFull Text:PDF
GTID:2168360095453130Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Data mining is one of the researching focus and frontiers in the modem Database research area, and proved to-be a multi-discipline field, of which the text mining and text classification are most active ones in applications. Text mining in Chinese raises its special problems according to the own characteristics of the Chinese Language.For the special requirement of an application, the author of this article designs and implements the core model of a text mining system named TextMiner in Java platform, and provides a rational framework and lots of available implementations for the high load Chinese text classifying. The main contribution of this article includes:(1) Summarizing the Object-Oriented analysis and design approaches used in the core TextMiner model developing, as well as the API programming thoughts. Discussing how to improve the extensibility, flexibility and pluggablity from the design step is well presented, and some important design pattern is interpreted by examples.(2) Describing how to rationally modeling the key steps of a Text Mining or Text Classification. The model here built turns to be more reusable than many other similar systems. Since TextMiner use XML technology for data binding, the reusability of temporary data is greatly improved.(3) Giving innovative design tips and concrete methods. For example, "filter chain" for the preprocessing of the text materials, "semi-marshal" and"semi-unmarshal" for the serialization and recovery of large-scale objects, two main mechanisms, "inclusive" and "exclusive", for feature selection. These new concepts are truly experiences for later similar systems.(4) Facing the data volume we faced are enormous, full of noise and different possible meanings, TextMiner chooses some data structures, such as hash set, cross indexing, and other redundant measures for the time performance trade-off. This article mentions some about them.(5) Presenting many implementations of text preprocessing, feature selection and classifiers. About how to select or combine these separated methods, a qualitative analysis is given near the end.This article is organized as follows:The first chapter briefly introduces the newly discipline of the data mining technology, talks about the necessity of Chinese text mining, and emphasis on the challenges brought from Chinese text classification. A land scope of TextMiner system is presented to face these special problems. About Chapter 2, on one hand, from the text classification system view, the main structure of the framework is described, on the other hand, from the software design angle, some main design pattern and innovative solution for the core model are presented by example. The Third chapter gives the implementation of the packages in detail, for the same time puts forward the related theories. The last chapter, the article analysis the combination use of the separated concrete methods, and looks forward of the ongoing work.
Keywords/Search Tags:data mining, text classification, object-oriented design, feature selection, classifier
PDF Full Text Request
Related items