A semantic partition based text mining model for document classification

Posted on:2007-02-22

Degree:M.Sc

Type:Thesis

University:University of Windsor (Canada)

Candidate:Inibhunu, Catherine

Full Text:PDF

GTID:2458390005488138

Subject:Computer Science

Abstract/Summary:

Feature Extraction is a mechanism used to extract key phrases from any given text documents. This extraction can be weighted, ranked or semantic based. Weighted and Ranking based feature extraction normally assigns scores to extracted words based on various heuristics. Highest scoring words are seen as important. Semantic based extractions normally try to understand word meanings, and words with higher orientation based on a document context are picked as key features. Weighted and Ranking based feature extraction approaches are used for creating document summaries that can act as their representations in the absence of the original documents. However, these two approaches suffer from some major drawbacks: (1) summaries generated could contain words that seem irrelevant to the document context, (2) sentences containing some key words could be eliminated if ranked lower than a given threshold, (3) summaries must be processed further in order to serve as input for mining algorithms like the Apriori.;Keywords: Text mining, text information mining, unstructured data mining, feature extraction, semantic orientation, text classification, semantic partitions, text summarization.;This thesis proposes Semantic Partitions (SEM-P) and Enhanced Semantic Partitions (ESEM-P) algorithms based on the semantic orientation of words in a document. This partitioning reduces the amount of words required to represent each document as input for discovering word frequent patterns from a collection of documents, while still maintaining the semantics of the documents. A weighting and ranking heuristic measure for each word term in a partition is used in ESEM-P to prune low ranked terms resulting in improved performance of the ESEM-P over the SEM-P. Identified word frequent patterns are used to generate a document classification model.

Keywords/Search Tags:

Document, Text, Semantic, Used, Mining, Feature extraction, ESEM-P, Word

Related items

1	Research On Text Representation And Feature Extraction Methods Based On Conditional Co-occurrence Degree
2	Research On F Eature Word Extraction Of APP Based On User's Comments
3	Using Word Embedding And Text Feature For Event Extraction
4	Research On Key Problems In WEB Text Mining
5	A semantic graph model for text representation and matching in document mining
6	Research On Chinese Text Classification Based On Semantic Analysis
7	Research Of Text Mining Based On Semantic Analysis
8	Study On Feature Word Extraction And Semantic Orientation Analysis In Chinese Opinion Mining
9	Semantic Feature Extraction Algorithm, The Contents Of Text Classification
10	Text Understanding Based On Semantic Relevance Under Internet Environment