Study Of Clustering Engine Based On WWW

Posted on:2004-05-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W Zhang

Full Text:PDF

GTID:1118360095956607

Subject:Computer software and theory

Abstract/Summary:

Along with the rapid development and universal popularization of World Wide Web, information resources on Web have been expanded increasingly. This has causes current technologies of information retrieval difficult to satisfy the speediness and validity of users' information requirement. Today, search engine is the most commonly used tool for Web information retrieval, however, its current status is still far from satisfaction. So, how to find the new information retrieval technology has been a very important and difficult question.Data mining aims at extracting hidden, unknown, useful, unusual pattern or knowledge. It is also called KDD (Knowledge Discovery in Databases). Clustering is a basic form of data mining. By contrasting the similarity and dissimilarity in data, clustering can find out the data's inner characteristic and distribution rule, so we can obtain the further understanding. With the era of information and digital of media, Web data mining is becoming one of the hottest topics.By combining information retrieval technology with data mining technology, search engine may be up to a new high degree. It is a novel solution to apply Web data mining technologies in search engine, and it may lead to come a new revolution in search engine. So, the study of clustering engine based on WWW is very important and necessary.After systematically reviewing the development of Web information retrieval, data mining, search engine and clustering, this dissertation summarizes the existing problems in search engine, and presents the corresponding solutions. This paper focuses mainly on clustering Web search results in order to help users find relevant Web information easier and faster.The main contributions and innovations of this dissertation are as follows:(1) The current situations of application research on Web information retrieval, data mining, search engine and clustering are summarized. We pointed out the study of search engine based on WWW is a crucial research subject.(2) In this paper, the Rough set theory is deeply researched, a concept of extended discernibility matrix is introduced, and an algorithm ROUSTIDA (A Rough Set Theory based Incomplete Data Analysis Approach) for analysis with incomplete data based on Rough set theory is proposed. The advantage of this algorithm is that it uses only theinformation given by the operationalised data, and does not rely on other model assumptions.(3) The benefits of using key phrases as natural language information features are discussed. An effect method based on suffix array for key phrase extraction is presented. The algorithms of find_ and combine__ are also presented. The algorithm of find_ is to discover the right complete string, combine__ is to find the complete string of a document. We further analyze the presented algorithms and give out the example to illustrate the correctness and effectiveness of the proposed algorithm.(4) The concept of genetic algorithm, its configuration, operators and existing problems are introduced in this paper. A new algorithm for clustering analysis is presented based on genetic algorithm. There are two characteristics in our approaches. Firstly, the algorithm is the general-purposed and our clustering analyzer can cluster large data set with mixed numeric and categorical attributes. Secondly, it improves the efficiency of data mining and the quality of the knowledge.(5) A prototype system of search engine based on data mining is designed and implemented. It can group Web search results in a semantic, online and tree way, i.e. SOTC (Semantic Online Tree Clustering). It is also able to process Web information in Chinese.(6) This paper concludes by summarizing the research and indicating its future orientation...

Keywords/Search Tags:

Web, information retrieval, data mining, search engine, clustering, Rough set theory

Related items

1	Text Clustering And Its Application In Web Community Search Engine
2	Study On Rough Set Based Data Mining Methods
3	Intelligent Search Engine Based On Thematic Information Technology Research,
4	Research On Rough Set For Application In Web Mining
5	A Study On The Application Of The Techniques Of Data Mining In Personalized Information Retrieval System
6	Search Engine Research And Design
7	Personalized Search Engine Based On User Interest Model, Research And Analysis
8	Research Of Text Mining Based On Rough Set Theory
9	A Study On Internet Information Retrieval And Developing Trend
10	Research And Improvement Of Web Structure Mining Algorithm