Font Size: a A A

Research On Web Clustering Algorithm Based On Chinese Retrieval

Posted on:2017-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:S Y TianFull Text:PDF
GTID:2308330503979773Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, users put forward higher requirements on information searching accuracy and efficiency, but the traditional search engine has some drawbacks. Clustering search is the search engine returns the result of clustering, and then extract the label and class cluster is presented to the user. According to the label, the user can obtain the overall information of the retrieval result directly, and then quickly locate the information of interest, which can improve the efficiency of the query.Carrot2 is an open source clustering search engine system, which uses clustering algorithm has a strong representation in the Web Retrieval Clustering. In this paper, the K-means and Lingo algorithms used in Carrot2 are deeply studied, and some improvements are made. This paper selects K-means algorithm and Lingo Carrot2 used to study and make some improvements.K-means is a classical partition clustering algorithm, which is simple and has low time complexity, but there are also some shortcomings, such as the determination of the K value, the initial clustering center selection, and vulnerable to noise data impact. This paper determines the K value and the initial clustering center according to the characteristics of Web search results, and then use the weight instead of the average method to reduce the influence of the noise data. After several tests, it is found that the reasonable classification of documents. This paper according to the characteristics of Web search results ranking,improved the weight calculation formula to make the document in the top of clusters. Such as HUAWEI search results, the document of “HUAWEI mobile phone Encyclopedia” into the mobile phone or Encyclopedia cluster are all right, but put into the phone cluster more reasonable.Lingo algorithm is a clustering algorithm based on latent semantic indexing. This paper firstly analyzes the factors that affect the clustering effect in Lingo algorithm, and then improves the weight calculation formula.The experiments show that the improved K-means algorithm can solve the problem of the documents of the hard clustering algorithm, and has good clustering effect, and the readability of the tag is also high. The accuracy of the improved Lingo clustering algorithm has been greatly improved.
Keywords/Search Tags:Web search, Clustering, K-means, Lingo, Feature weight
PDF Full Text Request
Related items