Font Size: a A A

Research On Constructing Web Directory Based On Thematic Clustering

Posted on:2011-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:G D YanFull Text:PDF
GTID:2178360308963910Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
It's an important approach for people to get information via Internet. Statistic data published by China Internet Network Information Center (CNNIC) shows that by December 31, 2009, the number of Internet users in China had reach 384 million, and the number of web pages had reach 33.6 billion, which had increased by more than 100% in a year. As the fast grow of web information, it's becoming more and more important to find an effective way to organize web text, helping people to get the information they want.Web directory and search engine are two most important ways for Internet user to get information. Search engine searches the web pages by matching keywords, which normally returns many search results and tend to be redundant. Web directory organize web pages by topic, aim at navigating user accurately. Sometimes it fulfills the requirement of user more conveniently. Now the web directories are usually constructed manually by experts and volunteers working together through network cooperative editing platform to edit and maintain web resource. It requires much human labor and is hard to deal with the fast growing web pages. In this paper, we use text clustering to construct web directory.Text clustering is an import technique of text analyzing. It has been used in various circumstances such as automatic organizing of documents, preprocessing of multi-document abstracting and cataloging of search results. Traditional text clustering method has some drawbacks in processing large amount of text growing dynamically. High dimension of the feature space causes decline of performances and the feature words selected by frequency may lead to mismatch of word importance and word weight, which results in inaccurate clustering. We propose a text clustering algorithm base on thematic word for web page clustering and use it to construct web directory automatically. A few thematic words are selected as the feature of the document, reducing the feature dimension and making use of the semantic information. Considering the number of clusters of web text is unknown, an improved CBC algorithm is proposed to determine the cluster centers adaptively and globally. Hierarchical clustering and incremental clustering are introduced to conform to the feature of web directory and the ever growing web page.
Keywords/Search Tags:Feature Extract, Text Clustering, Thematic Word, Web Directory
PDF Full Text Request
Related items