Font Size: a A A

Research Of Chinese Page Automatic Classification Based On Vector Space Model

Posted on:2009-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:J FengFull Text:PDF
GTID:2178360245999984Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the blooming of the Internet information, it is necessary for information retrieval technology to be improved. Page automatic classification is a useful tool to improve the performance of search engine. It is a key procedure of the Chinese information process. Page classification can automatically classify web pages to certain topics to show users a legible topic list and make user find what they want much easily. The research of page classification has become a hotspot in information retrieval field.First introduce the principle, procedure and relative technology of page classification, including page pre-process, vector space model and feature extraction. Then discusse several popular text classification algorithms and their metewand. At last summarize page classification's advantage to search engine. By comparing these algorithms, find that KNN is the most favorable one for web page classification and is the keystone to research for our work. Research KNN (K-Nearest Neighbors) algorithm deeply, analyze the defects of the KNN, and propose an improvement method- reprehensive sample algorithm to make KNN more efficiency. This improvement is experimented on the Chinese Page Classifier KNN to be validated. Analyze the layout and structure of pages and find that relative links are associated with topic to certain degree. On the basis of the page block, propose a modify weight method: adding weight of the relative links method using page structure and links information. Design a Chinese Page Classifier of KNN (CPCK) and implement it to validate these improvement algorithms.
Keywords/Search Tags:Page classification, Vector Space Model, Feature Selection, KNN Algorithm, Representative Samples
PDF Full Text Request
Related items