Font Size: a A A

Research On Web News Topic Detection And Tracking

Posted on:2008-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:C LuoFull Text:PDF
GTID:2178360242972321Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Topic Detection and Tracking (TDT in short) is an event-based information organizing task for detecting the appearance of new topics and tracking their reappearance and evolution. Its purpose is to organize information efficiently and help people finding what they want easily. In recent years, it is theoretically and practically valuable in military and other fields. This dissertation studies the models, algorithms and applications of several key research topics of TDT, including web crawler, web noise cleaning, news topic detection and tracking. The major contribution of this dissertation is as follows:Firstly, this dissertation designs and realizes a general web crawler to fulfill the demand of the following TDT, where the protocol of Robots is analyzed and web style is classified and the news time is parsed. The experiment shows that the web crawler have nice generality and can automatically download web pages and provide sufficient support for following information applications.Secondly, combining the knowledge of noisy information embedded in Web pages with the way of representing web contents, a new algorithm based on VSM for web noise cleaning is presented. The approach divides the web contents into different blocks according to HTML tokens, picks out the topic content and identifies web noise by using the similarity contrast technology between the topic content and the rest of contents. Experiments show that this algorithm excels other traditional methods in integrality and accuracy of the web cleaning.Thirdly, a method of topic detection based on adaptive centroid vector is proposed to avoid the shortcoming of current adaptive methods. The new method introduces name entities to represent topic and combines preliminary topic centroid vector with every modified centroid vector for topic detection. Experiments show that the new algorithm lowers the probability of miss and false alarm errors, and improves the performance of topic detection system.Finally, considering the sparseness of positive examples, a method of modified KNN-based topic tracking is introduced. The new method modifies traditional KNN classifier for topic tracking and could lessen the side-effect of densely populated negative examples. Furthermore, a time-window is imposed to decrease the complication of topic tracking. Experiment shows that the improved algorithm overcomes the sparseness of training set and enhances stability of topic tracking.
Keywords/Search Tags:Topic Detection and Tracking, Web Crawler, Vector Space Model, Name Entities, Topic Centroid Vector, K-Nearest Neighbor
PDF Full Text Request
Related items