Font Size: a A A

Hot Topic Detection From Forums Based On Clustering Analysis

Posted on:2011-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z S ZhangFull Text:PDF
GTID:2178330338981049Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Topic Detection is a sub-process of TDT (Topic Detection and Tracking) that attempts to identify topics by exploring and organizing the content of textual materials and find the new topic online. In TDT technique, HTD (Hot Topic Detection) aims to identify the timely and important topics in a specific period of time. The traditional TDT technique mainly solves the problem towards the context of news, but seldom researches are in forums hot topic detection.In this paper, we analyze the character of the text organization in forums. Also, we introduce the topic detection technique in details, especially the text pre-processing methods and clustering algorithms. Based on the research, our main contribution is in the following fields.1. We propose a Dom-tree Based forums post information extraction algorithm through the research on the web page information extraction methods and the analysis of forums post pages. We define how to generate post Dom-tree, regular expression rules and the effective extraction algorithm. Through sufficient experiments, we prove that the algorithm performs well in post information extraction and it provides sufficient support to the following process.2. We propose a feature extraction algorithm BSDFS (BBS Short Document Feature Selection) by analyzing the organization of forums and the inherent characteristics of the short document. We prove by experiments that this algorithm can get better results than the traditional feature extraction methods.3. We propose a timeline analysis incremental clustering algorithm after doing a large amount of research in the clustering algorithms and the requirement of Hot Topic Detection. Compared to the traditional clustering algorithms, we introduce the topic lift-support model into the clustering process which gifts the liveness to the topics. Through sufficient experiments, we prove that the algorithm effectively optimize the result of topic detection.4. We propose a hotness calculation algorithm based on the topic focus and user attention which can accurately evaluate the hotness of a topic. The experiments show that the algorithms can present a scientific sort of hot topics.Based on the theoretical research above, we design and implement a relation network graph visualization prototype system. The prototype system is integrated with several hot topic detection methods and technique. There're six modules, including the Database module, the forums Crawler module, the post Information Extraction module, the pre-processing module, the topic detection module and the hotness evaluation module. With the help of this prototype system, we can fetch the hot topics discussed in the forums.
Keywords/Search Tags:forums, information extraction, feature selection, clustering, hot topic detection
PDF Full Text Request
Related items