Font Size: a A A

The Study And Application Of Web Text Data Mining Technology Based On The Approximate Pages Clustering Algorithm

Posted on:2006-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:W Z YangFull Text:PDF
GTID:2168360155462626Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of the Internet information, the development of the data mining technology and the emergence of XML, the Web data mining technology rapidly becomes the focus in the information retrieval domain. Web data mining technology, searching engine technology, XML language, texts mining technology are systematically studies and their characteristics, principles, methods and present studying conditions introduces.Now Internet has become the main information source and people can get relevant information from it by using existing searching engines.But the information is always massive and disorder and users can difficultly obtain what they truly concern. So how to mining the documents returned by the exsisting searching engines by using Web texts data mining technology to get searching mode the users like is deeply studied.In this paper, an algorithm on clustering Web pages in view of small texts is proposed.This algorithm expresses the text characteristic by using the vector space model and clusters the vocabulary interested (users can initialize it according needs) by the users with fuzzy clustering analysis method to obtain knowledge pattern. When users search information, the repeated pages are removed by using MD5. The rest pages are clustered and ordered.Thus users can select interested pages to explore.The "precision" of information searching are improved greatly.To guarantee the "recall", this algorithm can cluster the retrieval pages comes from several search engine systems. At last, in order to put the pages cluster users more interesting former and considering the valid of the users and their interests, a data mining algorithm of Web accessing sequence based on Markov's chain and use it to order the approximate pages clusters is proposed. It has been found that this algorithm can impoved searching efficiency greatly while ensure recall and precision. Since aims at small texts data mining, its complexity of time and space is not high. So it can be said this algorithm will become one kind of practical and effective information retrieval technology.Based on the above idea, an intelligent searching engine system is designed. It runs quickly, pays attention to the "recall" and "precision"at the same time and has high searching efficiency. Now, this system has been successfully used in TW-OA.The practice shows that the result in this paper is practicable and valid.
Keywords/Search Tags:Information searching, Web text data mining, Text clustering, approximate pages clustering, Intelligence search engine system
PDF Full Text Request
Related items