Font Size: a A A

Research On Parallel Non-Intervention Document Clustering Algorithm

Posted on:2011-10-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F YangFull Text:PDF
GTID:1118330332982874Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the society informationalization, the number of information records existing in the form of documents is increasing, with the faster accumulation and richer classes. Efficient utilization of the documents is one of the focuses currently. Document clustering analysis is the important efficient methoed for document utilization. Document clustering algorithm is based on document attributes for document classification. The documents in the same class have high similarity, and the documents in different classes have low similiarity. This thesis focuses on the research on the key technology in large-scale document clustering. In this thesis, parallel non-intervention document clustering algorithm is proposed, which includes keyword extraction, the initial seed selection, document clustering algorithm and related parallism.First, this thesis reviewed the current research on the key technology of document clustering and analyzed the typical archievements and their contributions. The work in this thesis was based the summary.And then this thesis described the keyword abstraction method based on the distance factor and the initial seed selection based on the single shortest path graph (SSPG). This thesis added the Distance Factor to the traditional TF-IDF equation. Distance factor could screen and sort the feature items, and then select the most suitable candidates to express the documents. On this basis, the words would be clustered and some representative keywords were chosen for keywords. After the completion of keyword exatraction, the SSPG-based initial seed selection algorithm was proposed. It selected the initial seeds within the regions having high density and dispersed them among the data space in order to select the suitable initial clustering seeds.And then this thesis proposed the SLPPCA, which was a document clustering algorithm based on Stem-Leaf-Point Plot (SLPP). The features of Stem-Leaf Plot (SLP) were analyzed first and then the "leaf-point" was added to construct the SLPP. Before document clustering, the SLPP was constructed for the data space and the preliminary classification of the data objects was built. On this basis, boundary point set was defined by finding the boundary points and the internal points were also found out. At last, the clustering was carried out based on the boundary point set and the internal point set. The clustering based on SLPP could pre-process the data set and classify the data according to their similarities and differences. Therefore, SLPPCA could calculate the clustering number and complete the clustering task without invention.At last, the document clustering algorithm based on SLPPCA was studied under the parallel processing environment in this thesis. This thesis analyzed the background of new technology such as multicore processors and the features of multi-thread model. And then the method was presented for SLPPCA parallism. The steps of serial SLPPCA was analyzed with the parallel decomposition. The parallism method of SLPPCA was depicted, which would associate the parallism of SLPPCA with multi-thread for parallel optimization. Through the realization of parallel multi-threaded SLPPCA, SLPPCA could cluster the documents in parallel to speed up the execution of SLPPCA, take full advantages of the rich resources brought by new technology and realize the document clustering with high efficency.This work focused on the document clustering. The main contributions of this thesis were:1) it proposed the keyword extraction method; 2) it designed the SSPG based initial seed selection algorithm; 3) SLPPCA was proposed and 4) the parallel multi-threaded optimization was research for SLPPCA. The theory in this thesis was verified through experiments. The experimental results showed that this work can deal with the large-scale document dataset in parallel. It could improve the efficiency and quality of document clustering and was a docmument clustering algorithm with high efficiency.
Keywords/Search Tags:clustering analysis, document clustering algorithm, featured item selection, SLPP, parallel Computing
PDF Full Text Request
Related items