Font Size: a A A

Study Of Clustering Engine Based On WWW

Posted on:2004-05-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:1118360095956607Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid development and universal popularization of World Wide Web, information resources on Web have been expanded increasingly. This has causes current technologies of information retrieval difficult to satisfy the speediness and validity of users' information requirement. Today, search engine is the most commonly used tool for Web information retrieval, however, its current status is still far from satisfaction. So, how to find the new information retrieval technology has been a very important and difficult question.Data mining aims at extracting hidden, unknown, useful, unusual pattern or knowledge. It is also called KDD (Knowledge Discovery in Databases). Clustering is a basic form of data mining. By contrasting the similarity and dissimilarity in data, clustering can find out the data's inner characteristic and distribution rule, so we can obtain the further understanding. With the era of information and digital of media, Web data mining is becoming one of the hottest topics.By combining information retrieval technology with data mining technology, search engine may be up to a new high degree. It is a novel solution to apply Web data mining technologies in search engine, and it may lead to come a new revolution in search engine. So, the study of clustering engine based on WWW is very important and necessary.After systematically reviewing the development of Web information retrieval, data mining, search engine and clustering, this dissertation summarizes the existing problems in search engine, and presents the corresponding solutions. This paper focuses mainly on clustering Web search results in order to help users find relevant Web information easier and faster.The main contributions and innovations of this dissertation are as follows:(1) The current situations of application research on Web information retrieval, data mining, search engine and clustering are summarized. We pointed out the study of search engine based on WWW is a crucial research subject.(2) In this paper, the Rough set theory is deeply researched, a concept of extended discernibility matrix is introduced, and an algorithm ROUSTIDA (A Rough Set Theory based Incomplete Data Analysis Approach) for analysis with incomplete data based on Rough set theory is proposed. The advantage of this algorithm is that it uses only theinformation given by the operationalised data, and does not rely on other model assumptions.(3) The benefits of using key phrases as natural language information features are discussed. An effect method based on suffix array for key phrase extraction is presented. The algorithms of find_ and combine__ are also presented. The algorithm of find_ is to discover the right complete string, combine__ is to find the complete string of a document. We further analyze the presented algorithms and give out the example to illustrate the correctness and effectiveness of the proposed algorithm.(4) The concept of genetic algorithm, its configuration, operators and existing problems are introduced in this paper. A new algorithm for clustering analysis is presented based on genetic algorithm. There are two characteristics in our approaches. Firstly, the algorithm is the general-purposed and our clustering analyzer can cluster large data set with mixed numeric and categorical attributes. Secondly, it improves the efficiency of data mining and the quality of the knowledge.(5) A prototype system of search engine based on data mining is designed and implemented. It can group Web search results in a semantic, online and tree way, i.e. SOTC (Semantic Online Tree Clustering). It is also able to process Web information in Chinese.(6) This paper concludes by summarizing the research and indicating its future orientation...
Keywords/Search Tags:Web, information retrieval, data mining, search engine, clustering, Rough set theory
PDF Full Text Request
Related items