Font Size: a A A

The Application And Study Of Clustering Analysis In Text Mining

Posted on:2009-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:J Y GuoFull Text:PDF
GTID:2178360272456545Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the real world, the text is the most important carriers of information. In fact, the study shows that there is 80% information in the text document. Especially on the Internet, text data widely exists in various forms, such as news, e-books, research papers, digital libraries, Web pages, e-mail and so on. People need the tools badly to find resources and knowledge quickly and effectively. In recent years data mining research aimed at the text has gradually become a new topic. The clustering of the text has been aroused extensive attention and has achieved good results.Firstly, in this paper, we do an in-depth theoretical study on the text mining and clustering analysis and review the text mining on the domestic and international status and its relations with similar fields. By mathematics form we express and discuss the basic concepts of data's type, distance, similar factor etc in the clustering analysis. Then we analysis five types of commonly used clustering algorithm, and make a contrast and discussion on the performance of various algorithms.Then this paper makes a research on the text pretreatment process and methods. We discuss the method of transforming the unstructured text data into the structured data. The quality of pretreatment directly affects the final results of the text mining, we detailed introduce the process of the text pretreatment with text mining features.Finally, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related toit and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been used to build the topic summaries. Experimental results in the data collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.
Keywords/Search Tags:text clustering, topic discovery, hierarchical method, cluster, text mining
PDF Full Text Request
Related items