Font Size: a A A

The Research Of Web Text Clustering Based On Ontology

Posted on:2012-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2178330335953157Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the first annual ORG forum, Beckstrom pointed that 25% of the world uses the internet by computer today[1], a conservative estimate of 1.75 billion. According state council information office minister Chen Wang's speech: On China's Internet development and management, the number of Web pages in china has reached 33.6 billion by 2010 and 87.8% is in text form on the Internet[2]. Obviously, internets infiltrates people's learning, work and play well every aspect of life with unimaginable speed. So text mining continues to be a research hotspot for information retrieval as an important way to extract useful information via Web.Clustering technology has been widely used in many fields of information retrieval, a number of very sophisticated algorithms play an important role in every aspects. The traditional search engines, for a query which is vague, ambiguity, or belong to multiple topics, will return a long list of search results which have so many different themes scattered in the long list that users had to spend a lot of energy and time to find the goal Web pages which can meet the needs of the query. Thus, on the one hand, this situation reduce the quality of search results, on the other hand, it also greatly reduce users satisfaction. Researchers provide a number of methods to solve this issue. In these methods, the search results clustering method provides an effective solution, and correctly applied and constantly developed in practice. Now some of the more successful commercial search engine, such as Vivisimo [3], Infonetware RealTerm Search[4] are very successful examples in the application of search results clustering technology.The input of search results clustering system is usually a search results set returned from a traditional search engine in response to users query. Each search result is composed of the title, text summarization and hyperlink. The output is a labeled clusters set by grouping the search results. Therefore, the key resource which search results clustering engine interact with user is the clusters'label in output. It can be seen that search results clustering technology not only be able to cluster the documents from search results set, but also provide users some cluster labels which are easy to understand and have higher differentiation degree, while the traditional clustering techniques take the cluster center as cluster label.This paper presents an improved method for clustering search results focused on Chinese Web pages. The search results clustering technology is different from the traditional clustering technology. The former puts the cluster labels extraction on a very important position, while the later focus on similarity calculation and clustering structure. There are three important indicator of the quality of search results clustering technology: cluster label semantic integrity, cluster label readability, cluster content relevance. Good cluster labels can effectively distinguish the relationship and hierarchy between clustering, express the theme of each clusters intuitively and can guide users to quickly locate sought information. This paper proposed an ontology- based analysis method for improving the quality of cluster labels. Using the hierarchical concept of Hownet, our method tags the clusters with conceptualized cluster labels to advance the evaluation measures of CLR and CCR.The contributions of the dissertation are summarized as following:(1) A method which identifies the complete semantic information phrase by comparing the attributes of base clusters in the suffix tree document model and the overlap of their document sets is presented.(2) In order to better respond to the associate degree of terms, a novel method is proposed which compute the distance in sentence-grain of terms' co-occurrences.(3) We proposed the concept of the contribution of base cluster, by allotting proper weight for base clusters according to the number of words in the base clusters and the part of speech of these words, in order to determine whether a base cluster are eligible to become a candidate for the set of cluster labels.(4) With the hierarchical concept of Hownet, match the candidate cluster labels with the concept, achieve the objective for extracting conceptual cluster labels and enhancing the labels representation from the plain text to the semantic level. .
Keywords/Search Tags:search results clustering, suffix tree, association calculation, ontology, cluster labels
PDF Full Text Request
Related items