Font Size: a A A

Research And Implementation Of Method For Web Noise Elimination And Feature Selection

Posted on:2011-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:X Z SuFull Text:PDF
GTID:2178360305961486Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web page classification can solve a large extent the disorder problem of information. Because web page contains a great deal of noise and feature selection effects web classification, how to decrease the web noise and improve feature selection is vital important significance to web page classification. Web page classification has become a hotspot.Firstly, STU-DOM algorithm can not determine to extract the content in the web page which is no hyperlink, and the content in the DIV tag, resulting HTML elimination result is not satisfactory. In this paper, STU-DOM algorithm is extended, taking into account the body of information on the Web TABLE, DIV tag. The content is extracted by calculating the frequency of word co-occurrence between the title and node and calculating the similarity of the text. It is need to calculate the frequency of word co-occurrence between the title and node, for preserving the node more than a given threshold, otherwise, the similarity between the current TABLE or DIV extracted and to be extracted node need to be calculated. The similarity value greater than a given threshold, then extract the current node as the body of the page. It's need to detect. According to the results of detection to decide whether to continue to extract the current TABLE or DIV tag in the page. Secondly, Relative Frequency Difference (RFD) algorithm assigns a higher value to the term that can't differentiate web page categories. Another improvement considers the absolute value of the sum of the term's representation and the identification. The improved algorithm is verified through the classifier achieved better classification performance. Finally, The crawler based on the open source bot.jar package has been extended by calculating the similarity between the URL to be crawling and topic. If the URL meets the threshold of relevance URL, it will be added to the waiting queue, also by calculating the similarity of the crawling web content and sports category feature vector. If the page meets the similarity threshold, it can be saved to local disk. This paper implements the crawler. It can download page of the sports theme. It has created a corpus for the sport of these tests set by the training of the classifier to determine the optimal threshold to download the theme page. This paper realizes web page elimination and feature extraction. The experiment indicates the effectiveness of the algorithm through classification tests.
Keywords/Search Tags:Web Page Elimination, STU-DOM, Feature Selection, Relative Frequency Difference, Crawler, Word Co-occurrence
PDF Full Text Request
Related items