Font Size: a A A

Based On The Content Of The Chinese Web Document Clustering Method Research And Application,

Posted on:2007-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2208360185955993Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the technology of organization and retrieval of webpage is one of the international research hot-spots. Document clustering techniques have been received more and more attentions as a fundamental and enabling tool for efficient organization, navigation, retrieval, and summarization of huge volumes of text documents. The aim of document clustering technique is to cluster different documents into different semantic classes based on their content in an unsupervised manner. The combination of document clustering technique and web search engine has become a hot-spot in document mining area. But there are seldom researches in using document clustering technique into Chinese web documents and cooperating with Chinese Web search engine services. To this practical problem, This paper undertakes the program"The data mining service system based on Web application, MinerOnWeb", and makes some deep investigations on document clustering of Chinese web document.Phrase-Based document clustering method for Chinese web document, the one of kernel technology in this project, is a new method that can improve the disadvantage of implementing the traditional text represent model into Chinese document. Traditional method faces the difficulties that need to handle high dimension vector and Chinese word segment. The thesis proposes a new document clustering method that uses a model named Document Index Graph to represent Chinese documents. Based on this model, the similarity between different documents can be calculated by finding marched phrases in documents, which avoided the process of Chinese word segment and handling high dimension vectors. For this way, the subject related documents can be clustered together by using incremental clustering algorithm.The new method has been implemented as the Chinese search engine results clustering subsystem in MinerOnWeb. MinerOnWeb is a data mining service system based on web application, which provides variety data mining services. By using Chinese search engine results clustering sub-system, the Chinese search engine results can be clustered by their subjects and displayed according their subjects.
Keywords/Search Tags:Text clustering, Data mining, Search engine, J2EE
PDF Full Text Request
Related items