Clustering is an important statistic-based unsupervised method in information processing, which makes a key foundation in many application fields. Document clustering is aimed at partitioning documents into clusters. As for information retrieval, document clustering speeds up the search and improves retrieval effect.WAF is able to represent a document. Document clustering has been studied for several decades. Recent years, WAF model, which utilizes co-occur of terms to build up a model, shows a remarkable result in dataset statistics and term relation analysis. In comparison to VSM model, WAF model carries more information about the document that probably makes it one of the effective document representation models.In this paper, document clustering based on WAF model is studied in the following aspects.First, meanings and theorems of WAF are analyzed and deduced. On one side, the geometric significance of WAF is analyzed from the aspect of graphs. On the other side, the physical significance of WAF model is deduced from the aspect of language model and information theory. Second, WAF as a document model is improved. To adapt WAF model for document modeling, similarity between different WAF-based. document models is defined. Furthermore, smoothing technics are introduced for WAF model.Third, WAF-based clustering is experimented with English Wikipedia documents. VSM model is used as controls to evaluate the clustering effect of WAF-based model and to prove the effectiveness of WAF as a document representation model.Last, clustering and storage methods in a real project are introduced, which input is big data short texts. Fast clustering algorithm is adopted to deal with big data, which is aimed at scaling down the workload of downstream modules. AIS is proposed to store time series big data stream. |