Research And Implementation Of Clustering Systems Of Web Search Results

Posted on:2012-12-21

Degree:Master

Type:Thesis

Country:China

Candidate:D X Liu

Full Text:PDF

GTID:2178330335460304

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the rapid spreading of Internet and technique development of Search Engines, more and more people use them to search their requests, although many Search Engine systems have been trying their best to improve the retrieval precision, the results include a lot of irrelevant documents mixing with the relevant ones, which brings the web users a big problem to locate information for their need.With clustering of the retrieved results of Search Engine, the groups which are established should have a high degree of document association between members of the same groups and a low degree between members of different groups. Thus the users may browse their interested groups and so that it will save them much time.Conducting a deep study on the feature extraction, feature weighting and document clustering algorithm, we take the TF-IDF algorithm to extract and weight features, use the STC algorithm to search the possible application of clustering of retrieved results of Search Engine. The work mainly focused on the following points:1) On the basis of an expanded research into techniques of text pretreatment, we achieved a multifunctional preprocessing part which contains many functions such as retrieved results acquiring, page denoising, word segmentation and stop-words removal;2) The TF-IDF weighting is the most commonly used method of VSM (vector space model). It can enhance the weighting of high-frequency words in a document and weaken the high-frequency words of the whole documents effectively. But it ignores the impact of word POS and position, so we modified the TF-IDF formula by introducing the word POS factor and the position factor into it,based on the experiments, we can conclude that it can improve the Macro_F1 and Micro_F1 of STC and enhance the performance of the clustering system;3) After a profound research into STC, we design a comparison between Lingo, K-means and STC, according to the experimental results, STC is superior to the other two algorithms in clustering and label induction. The labels generated by STC are much appropriate, and they perform well in representing the information of retrieved results and in temporal complexity.The particular data analysis of the experiments proves that the clustering system is highly active and the anticipation goal is achieved.

Keywords/Search Tags:

search engines, retrieved results, clustering, tf-idf, STC

PDF Full Text Request

Related items

1	Research And Improve On Clustering Method Of The Search Engine's Retrieved Results
2	Merging multiple search results approach for meta-search engines
3	The Research And Design On Personalized Search For Meta Search Engines
4	Students' success with World Wide Web search engines: Retrieving relevant results with respect to end-user relevance judgments
5	The Study On Web Search Results' Clustering
6	Research On Semantics-Based Search Results Clustering Methods
7	Chinese Search Results Clustering Research Based On Improved STC
8	Research On Search Results Clustering And Label Extraction
9	Research On The XML Pseudo Relevance Feedback Technology Based On Clustering Search Results
10	The Study On Web Search Results' Clustering