Font Size: a A A

A Distributed Indexing Method Of Large Scale Document Set Based On Clustering

Posted on:2017-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:W L WangFull Text:PDF
GTID:2308330485482226Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The full text retrieval is one of the most useful solution to get accurate information in the period of big data. The most important part in full text retrieval is the management of index. In big data era, management of centralized index is facing great challenge, and one of the most useful solution is building distributed index. The way of index splitting is the key problem in distributed index. There are both advantages and disadvantages in the two ways of splitting index-method based on words and method based on document.In this paper, we studied the related technique of distributed index. Based on the pre-technique, we proposed the distributed index building method based on the optimized clustering method. In this method, we split the document to several clusters with the optimized k-means algorithm, and then create local index of every clusters. This method has some advantages such as load balancing and low expending of network transmission and also it avoid searching on all the local index. In this paper, we optimized and paralleled the k-means algorithm and split the document using this algorithm. Finally, we improve the efficiency and performance of the system and make the system more stable and balance.In this paper, we first studied the common text clustering method. We found that most optimized method need rather big computing recourses and it does not fit for the big data environment. Therefore, based on pre-technique, we proposed an optimized K-means algorithm based on sample clustering method and we named it SCB-K-means algorithm. This algorithm improved the pre-work and choses the initial points based on sample-clustering. This algorithm improve the clustering result effectively.At last, in this paper, based on Hadoop frame, using HDFS and MapReduce algorithm model, we realized the parallel SCB-K-means algorithm. And we create a distributed index of a big dataset with the SCB-K-means algorithm. Some experiments prove the method in this paper has a good performance in efficiency and search result.
Keywords/Search Tags:Distributed index, Document Clustering, K-means algorithm, MapReduce
PDF Full Text Request
Related items