Font Size: a A A

Research And Implementation Of Clustering Systems Of Web Search Results

Posted on:2012-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:D X LiuFull Text:PDF
GTID:2178330335460304Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid spreading of Internet and technique development of Search Engines, more and more people use them to search their requests, although many Search Engine systems have been trying their best to improve the retrieval precision, the results include a lot of irrelevant documents mixing with the relevant ones, which brings the web users a big problem to locate information for their need.With clustering of the retrieved results of Search Engine, the groups which are established should have a high degree of document association between members of the same groups and a low degree between members of different groups. Thus the users may browse their interested groups and so that it will save them much time.Conducting a deep study on the feature extraction, feature weighting and document clustering algorithm, we take the TF-IDF algorithm to extract and weight features, use the STC algorithm to search the possible application of clustering of retrieved results of Search Engine. The work mainly focused on the following points:1) On the basis of an expanded research into techniques of text pretreatment, we achieved a multifunctional preprocessing part which contains many functions such as retrieved results acquiring, page denoising, word segmentation and stop-words removal;2) The TF-IDF weighting is the most commonly used method of VSM (vector space model). It can enhance the weighting of high-frequency words in a document and weaken the high-frequency words of the whole documents effectively. But it ignores the impact of word POS and position, so we modified the TF-IDF formula by introducing the word POS factor and the position factor into it,based on the experiments, we can conclude that it can improve the Macro_F1 and Micro_F1 of STC and enhance the performance of the clustering system;3) After a profound research into STC, we design a comparison between Lingo, K-means and STC, according to the experimental results, STC is superior to the other two algorithms in clustering and label induction. The labels generated by STC are much appropriate, and they perform well in representing the information of retrieved results and in temporal complexity.The particular data analysis of the experiments proves that the clustering system is highly active and the anticipation goal is achieved.
Keywords/Search Tags:search engines, retrieved results, clustering, tf-idf, STC
PDF Full Text Request
Related items