Font Size: a A A

Document Clustering In Search Engine

Posted on:2010-10-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ZhouFull Text:PDF
GTID:1118360275486798Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The amount of information in the Web increases explosively with the development of Internet. Nowadays, Search engine (SE) as a foundation application in Internet has become the main tool for people to find information they want from the Web. In the current search engines, data mining and machine learning are widely employed to provide better search results for users. Clustering technology plays an important role in SE and attracts a lot of interests from both industry and academic. Without any prior knowledge, clustering technology can partition a large number of documents into a small number of clusters according to document similarities. The generated clusters help people to understand documents quickly.There are two kinds of data in search engine. One is documents crawled from the Web, such as HTML Webpage, XML documents, and AJAX and Flash applications with less hyperlinks. The other one is query logs which record the interaction information between users and the search engine. Query log data provides a possible knowledge base to improve document clustering. The thesis discusses and studies document clustering technology as well as its application in search engine from three views: document representation, document similarity definition and clustering algorithm.Most of the existing text clustering algorithms overlook the fact that one document is a word sequence but not a set of discrete words. Frequent Itemset-based Clustering with Window (FICW) was proposed to obtain the semantic information in the positions of words for text clustering. FICW firstly mines frequent item-sets from the text collection with a window constraint on word sequences. And a heuristic strategy is applied on the item-sets to generate clusters. The experimental results show that FICW outperforms the method compared in both clustering accuracy and efficiency.XML has been widely used as a standard of exchanging data in the Web. Mea- suring the structure similarity among XML documents is the foundation of XML clustering. A structure similarity measurement is proposed based on Merge-Edit-Distance (MED). A merge tree is firstly constructed from XML document trees to be compared. Then MED is defined as the cost of the operation sequence from the merge tree to the common tree. MED upholds the distribution information of the common sub-tree in XML document trees. The experiments on real datasets give the evidence that the proposed similarity measure is effective. Another problem for XML similarity measurement is that when people describe the same object they may use quite different tags. To address the problem, a novel similarity measurement is proposed based on the data type tree. Since the types of data used to describe an object are not as changeable as tags, the data type tree capture XML document similarities better. Experimental results show that with data type tree similarity measurement, clustering algorithms can group the semantic similar XML documents together correctly, which contain different tags but describe the same object.The click information in query logs reflects which topic in a Webpage is the one that users have interest in. Therefore, it is possible to cluster WebPages from the view of users with query log data. Hybrid Vector Space Model (HVSM) is proposed for WebPages based on query logs. In HVSM, for a given Webpage, a virtual document is generated from its text content based on topic keywords extracted from search click-through data. The experimental results show that HVSM can improve the quality of the results of both Webpage classification and clustering.The current search engine cannot rank well weak-linked documents such as PowerPoint files and AJAX applications. Current search engines return therefore either completely irrelevant results or poorly ranked documents when searching for these files. RoC, a novel framework is proposed for correctly retrieving and Ranking weak-linked documents based on Clustering. The experiments show that our approach considerably improves the result quality of current search engines and that of latent semantic indexing.
Keywords/Search Tags:Search Engine, Web Usage Mining, Clustering Technology, XML Document Clustering, Weak-linked Document
PDF Full Text Request
Related items