Font Size: a A A

A semantic partition based text mining model for document classification

Posted on:2007-02-22Degree:M.ScType:Thesis
University:University of Windsor (Canada)Candidate:Inibhunu, CatherineFull Text:PDF
GTID:2458390005488138Subject:Computer Science
Abstract/Summary:
Feature Extraction is a mechanism used to extract key phrases from any given text documents. This extraction can be weighted, ranked or semantic based. Weighted and Ranking based feature extraction normally assigns scores to extracted words based on various heuristics. Highest scoring words are seen as important. Semantic based extractions normally try to understand word meanings, and words with higher orientation based on a document context are picked as key features. Weighted and Ranking based feature extraction approaches are used for creating document summaries that can act as their representations in the absence of the original documents. However, these two approaches suffer from some major drawbacks: (1) summaries generated could contain words that seem irrelevant to the document context, (2) sentences containing some key words could be eliminated if ranked lower than a given threshold, (3) summaries must be processed further in order to serve as input for mining algorithms like the Apriori.;Keywords: Text mining, text information mining, unstructured data mining, feature extraction, semantic orientation, text classification, semantic partitions, text summarization.;This thesis proposes Semantic Partitions (SEM-P) and Enhanced Semantic Partitions (ESEM-P) algorithms based on the semantic orientation of words in a document. This partitioning reduces the amount of words required to represent each document as input for discovering word frequent patterns from a collection of documents, while still maintaining the semantics of the documents. A weighting and ranking heuristic measure for each word term in a partition is used in ESEM-P to prune low ranked terms resulting in improved performance of the ESEM-P over the SEM-P. Identified word frequent patterns are used to generate a document classification model.
Keywords/Search Tags:Document, Text, Semantic, Used, Mining, Feature extraction, ESEM-P, Word
Related items