Research On Web Clustering Algorithm Based On Chinese Retrieval

Posted on:2017-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Tian

Full Text:PDF

GTID:2308330503979773

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data, users put forward higher requirements on information searching accuracy and efficiency, but the traditional search engine has some drawbacks. Clustering search is the search engine returns the result of clustering, and then extract the label and class cluster is presented to the user. According to the label, the user can obtain the overall information of the retrieval result directly, and then quickly locate the information of interest, which can improve the efficiency of the query.Carrot2 is an open source clustering search engine system, which uses clustering algorithm has a strong representation in the Web Retrieval Clustering. In this paper, the K-means and Lingo algorithms used in Carrot2 are deeply studied, and some improvements are made. This paper selects K-means algorithm and Lingo Carrot2 used to study and make some improvements.K-means is a classical partition clustering algorithm, which is simple and has low time complexity, but there are also some shortcomings, such as the determination of the K value, the initial clustering center selection, and vulnerable to noise data impact. This paper determines the K value and the initial clustering center according to the characteristics of Web search results, and then use the weight instead of the average method to reduce the influence of the noise data. After several tests, it is found that the reasonable classification of documents. This paper according to the characteristics of Web search results ranking,improved the weight calculation formula to make the document in the top of clusters. Such as HUAWEI search results, the document of â€œHUAWEI mobile phone Encyclopediaâ€ into the mobile phone or Encyclopedia cluster are all right, but put into the phone cluster more reasonable.Lingo algorithm is a clustering algorithm based on latent semantic indexing. This paper firstly analyzes the factors that affect the clustering effect in Lingo algorithm, and then improves the weight calculation formula.The experiments show that the improved K-means algorithm can solve the problem of the documents of the hard clustering algorithm, and has good clustering effect, and the readability of the tag is also high. The accuracy of the improved Lingo clustering algorithm has been greatly improved.

Keywords/Search Tags:

Web search, Clustering, K-means, Lingo, Feature weight

PDF Full Text Request

Related items

1	Based On The Text Of The K-means Clustering Analysis
2	Research On Clustering Systems Of Search Engine Results
3	The Design And Implementation Of Monitoring SMS System Based Lingo Algorithm
4	Clustering Methods And Applications For High-dimensional Data Based On K-harmonic Means
5	Study On Search Results Clustering Algorithm Based On Multi-Core Technology
6	Improvements And Implementation Of K-means Clustering Algorithm
7	Improved Parallel K-means Clustering Algorithm Based On Cuckoo Search
8	Research On Improved Non-local Means Image Denoising Algorithm
9	K-NN, K-means And The Application In Text Mining
10	The Research And Application Of Text Clustering Based On Improved K-means Algorithm