Document Clustering Method Based On WAF

Posted on:2014-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:X Mo

Full Text:PDF

GTID:2248330398972216

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Clustering is an important statistic-based unsupervised method in information processing, which makes a key foundation in many application fields. Document clustering is aimed at partitioning documents into clusters. As for information retrieval, document clustering speeds up the search and improves retrieval effect.WAF is able to represent a document. Document clustering has been studied for several decades. Recent years, WAF model, which utilizes co-occur of terms to build up a model, shows a remarkable result in dataset statistics and term relation analysis. In comparison to VSM model, WAF model carries more information about the document that probably makes it one of the effective document representation models.In this paper, document clustering based on WAF model is studied in the following aspects.First, meanings and theorems of WAF are analyzed and deduced. On one side, the geometric significance of WAF is analyzed from the aspect of graphs. On the other side, the physical significance of WAF model is deduced from the aspect of language model and information theory. Second, WAF as a document model is improved. To adapt WAF model for document modeling, similarity between different WAF-based. document models is defined. Furthermore, smoothing technics are introduced for WAF model.Third, WAF-based clustering is experimented with English Wikipedia documents. VSM model is used as controls to evaluate the clustering effect of WAF-based model and to prove the effectiveness of WAF as a document representation model.Last, clustering and storage methods in a real project are introduced, which input is big data short texts. Fast clustering algorithm is adopted to deal with big data, which is aimed at scaling down the workload of downstream modules. AIS is proposed to store time series big data stream.

Keywords/Search Tags:

WAF, document clustering, document similarity, smoothing

PDF Full Text Request

Related items

1	Research On Semantic Similarity Computation And Applications
2	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
3	Research Of XML Document Clustering
4	The Research Of Enterprise Document Retrieval Model Based On Ontology
5	Effects of similarity metrics on document clustering
6	Research On Cross-language Document Sorting Learning Method Based On Bilingual Document Similarity
7	Clustering Research Of XML Document
8	Web Document Automatic Classification Based On Keywords
9	Application Of Document Similarity Detection In Enterprise Document Leakage Prevention
10	Study Of Document Organization Method Based On Topic Map