A Distributed Indexing Method Of Large Scale Document Set Based On Clustering

Posted on:2017-04-02

Degree:Master

Type:Thesis

Country:China

Candidate:W L Wang

Full Text:PDF

GTID:2308330485482226

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The full text retrieval is one of the most useful solution to get accurate information in the period of big data. The most important part in full text retrieval is the management of index. In big data era, management of centralized index is facing great challenge, and one of the most useful solution is building distributed index. The way of index splitting is the key problem in distributed index. There are both advantages and disadvantages in the two ways of splitting index-method based on words and method based on document.In this paper, we studied the related technique of distributed index. Based on the pre-technique, we proposed the distributed index building method based on the optimized clustering method. In this method, we split the document to several clusters with the optimized k-means algorithm, and then create local index of every clusters. This method has some advantages such as load balancing and low expending of network transmission and also it avoid searching on all the local index. In this paper, we optimized and paralleled the k-means algorithm and split the document using this algorithm. Finally, we improve the efficiency and performance of the system and make the system more stable and balance.In this paper, we first studied the common text clustering method. We found that most optimized method need rather big computing recourses and it does not fit for the big data environment. Therefore, based on pre-technique, we proposed an optimized K-means algorithm based on sample clustering method and we named it SCB-K-means algorithm. This algorithm improved the pre-work and choses the initial points based on sample-clustering. This algorithm improve the clustering result effectively.At last, in this paper, based on Hadoop frame, using HDFS and MapReduce algorithm model, we realized the parallel SCB-K-means algorithm. And we create a distributed index of a big dataset with the SCB-K-means algorithm. Some experiments prove the method in this paper has a good performance in efficiency and search result.

Keywords/Search Tags:

Distributed index, Document Clustering, K-means algorithm, MapReduce

PDF Full Text Request

Related items

1	Parallel Clustering Algorithm Based On MapReduce
2	Research On Distributed Clustering Algorithm Based On MapReduce
3	Improved K-means Clustering Algorithm Based On MapReduce Framework
4	Research Of K-means Clustering Algorithm Based On MapReduce
5	Research On Parallelization Of K - Means Clustering Algorithm Based On MapReduce
6	Research On K-Means Algorithm Based On MapReduce
7	Research On Cluster Center Optimization Of K-means Algorithm
8	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
9	Research On Document Clustering Algorithm Based On K-means
10	Research On Parallelization Of Clustering Algorithm Based On Mapreduce