Font Size: a A A

Research On Chinese Webpages Classification Based On K-nearest Neighbour Algorithm And Relative Hyperlinks

Posted on:2009-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y N JinFull Text:PDF
GTID:2178360272979838Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, on-line information is growing exponentially. Faced with cluttered web information, we need classify these information, so that we can quickly retrieve the desired objective and relevant information. The automatic webpage classification provides the key technology of dealing with the large-scale webpage, and it is an important methods to organize the web information effectively. How to improve the precision and recall of the webpage classification is a goal for the researchers to pursuit.By Chinese webpage text extraction methods, we extract the body of Chinese webpage, and make a good combination with dealing with the tag, getting rid of noise and text extraction. Links in webpage are divided into two categories, with the theme of this page with a link called relative links, has nothing to do with the theme of the page called irrelevant links, such as navigation bars and advertising links, and so on. This thesis proposes an algorithm based on blocking the webpage's links to retrieve the relative links with good precision, the complexity of the algorithm has the character of time low, and precision and recall are satisfactory. Based on the vector space model, we choose the feature words with the word frequency, use machine learning algorithm KNN to classify the Chinese webpage, design and implement a Chinese webpage classification system. We compared the results of classification based on the title, classification based on text classification, classification based on relative links, as well as text and relative links classification together, title and relative links classification together. It is true that the relative links is helpful to the classification of webpage, and we propose a classification method meanwhile.Through the open tests, the experimental data show that the approach of body of the webpage and relative links classification together needs small training set, the F1 value of the various categories are all more than 92 percent, it is better than the traditional webpage classification approaches.
Keywords/Search Tags:Chinese webpages classification, webpage theme extraction, relative hyperlinks, K Nearest Neighbor, Vector Space Model
PDF Full Text Request
Related items