Incorporating semantic and syntactic information into document representation for document clustering

Posted on:2006-01-10

Degree:Ph.D

Type:Dissertation

University:Mississippi State University

Candidate:Wang, Yong

Full Text:PDF

GTID:1458390008958407

Subject:Computer Science

Abstract/Summary:

Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.

Keywords/Search Tags:

Clustering, Information, Incorporating semantic, Data, Performance

Related items

1	Incorporating physical information into clustering for FPGAs
2	Research On Recommending Approach Based On Semantic Analysis For RSS Web Information Services
3	Research On Recommending Approach Based On Semantic Analysis For Rss Web Information Services
4	Research On Deep Text Clustering Method Based On Semantic Information Enhancemen
5	Research On Clustering Algorithm For Web Document By Incorporating Distribution Information
6	Research On Ontology-Based Semantic Information Retrieval
7	Research Of Incorporating Side Information Into Multivariate IB Method For Multi-view Clustering
8	Incorporating background knowledge in document clustering
9	Design And Implementation Of Semantic Search Subsystem For Personal Data On Mobile Phone
10	Research On Semantic Processing Technology Based Information Retrieval Model