Font Size: a A A

Text Clustering And Its Application In Web Community Search Engine

Posted on:2007-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:W H LiuFull Text:PDF
GTID:2178360185454133Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine services become more and more important in information retrieval becauseof the development of the World Wide Web and the increase of data. Search engine userbehavior survey shows that it should help users construct new queries as well as rank theresults properly. Organizations are likely to join in a community to collaborate on same task,and so comes the web community. Search engines constructed for the community can helpusers searching information. One main achievement of the thesis is an effective search engineto serve web community with text clustering techniques;the other achievement lies inevaluating performance of clustering algorithms using clustering validation techniques.Based on the research of search engine systems, we conclude that search engine resultsranking schema is not feasible in some situations. Text clustering techniques could be adoptedto improve the performance considering the self-organization of web community contents.Reorganization of search results makes the results more brows-able, and can also assist theusers in constructing new queries.Principles of text clustering are presented. After deep analysis to several clusteringalgorithms such as Hierarchy Clustering, k-Means, Ant-based and Suffix Tree Clustering(STC), we evaluated the performance of the three algorithms by clustering validation technique.By using Reuters-21578 test collection, external criteria and well-developed experiments, wecome to such conclusion that STC outperforms both the k-Means and the Ant-based. This ismainly because of the linguistic characteristics --phrases that STC adopts.We constructed a search engine for Chinalab web community with web crawling,indexing and searching components. We use Lucene open source search component as oursearching and indexing component. The search engine system can group results according toits position. Performance analysis to the search system shows that it helps web users searchinginformation in the community. After that, search engine results are reorganized by textclustering technique. Evaluation of the results shows that search results clustering provides userinterface in a different way and improves performance of web community search engine.
Keywords/Search Tags:Text Clustering, Search Engine, Information Retrieval, VSM, Clustering Validity, Text Mining, Web Mining, Suffix Tree Clustering
PDF Full Text Request
Related items