Font Size: a A A

Research On Noise Reduction And Duplicated Webpages Deletion Method For Accident News Corpus

Posted on:2006-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LuoFull Text:PDF
GTID:2168360155956972Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
To news webpage, what should be extracted is its theme, but there is a large amount of noises besides the topic content. Usually they are unified in the structure constructed by HTML. HTML is a language of visual expression and it would be very difficult to extract the information about the structure of webpage after the edition finished. At the same time we find there is abundant HTML marks in the webpage and there is it's own characteristic of accidential news, so we mine web page structure, fully utilize HTML mark on the basis of forefather's research. We make a research on extracting of webpage title, text and date issued and so on from the editor's attitude.Because of the reprinting between websites, users often get the redundant page with same content in the result of webpage searching. It has not only wasted the storing resources, but also brought a great deal of inconvenience to information retrieval or other text-processing. The main content of this text is that dividing group according to data issued on the basis of accidental event fragility and that deleting the duplicated webpages by extracting information from specific area on the basis of noise reduction.On the basis of classical TFIDF(Term Frequency Inverse Document Frequency) method, we adopt the mixed characteristic word to express text...
Keywords/Search Tags:accidental event, news corpus, noise reduction, duplicated web pages removal, weight calculating
PDF Full Text Request
Related items