Font Size: a A A

Research Of Chinese Page Automatic Classification Based On Representive Samples

Posted on:2011-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:L P RenFull Text:PDF
GTID:2178360302499966Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the develop of the network technology, webpage quantity is increasing. It is necessary for webpage classification. Offer directly effective information to user. It is necessary for information retrieval technology to be improved. Search engineer has become the most popular web service on the Internet. Page automatic classification is a useful tool to improve the performance of search engine. It is a key procedure of the Chinese information process. Page classification can automatically classify web pages to certain topics to show users a legible topic list and make user find what they want much easily. The research of page classification has become a hotspot in information retrieval field. Page classification can reduce the numbers of pages user explore, improve the efficiency of search engineering. Page classification also contributes to information resource management, theme dictionary contribution and search engineering development.In the paper, first of all, the author analyzes the development background of this system and the development circumstance of the information management of page classification; summarizes the main problems needed to be resolved; briefly introduces main technique and main contribution adopted by system. Introduce the principle, procedure and relative technology of page classification, including page pre-process, vector space model and feature extraction. Then this paper discusses several popular text classification algorithms and their metewand. At last summarize page classification's advantage to search engine. By comparing these algorithms, find that KNN is the most favorable one for web page classification and is the keystone to research for our work. Research KNN (K-Nearest Neighbors) algorithm deeply, analyze the defects of the KNN, and propose an improvement method-reprehensive sample algorithm to make KNN more efficiency. On the design of system, author proceeds strictly according to unified modeling language UML. In realization part, system is in Windows XP operating system, with:6.0 Visual C developing instrument, combine HTML Parser, Boost relevant packages design come out place.The classifying device includes:keep and deal with the module, read the module, turn into module, calculation module on behalf of sample, person who stand for election tabulation turn into module, result tabulation turn into module, categorized result reveal module, result assess and test module, etc.. The system will improve the efficiency of searching, the classification fast and accurately, help users inquire fast that arrive relevant information of needing.
Keywords/Search Tags:Page classification, Vector Space Model, KNN Algorithm, Representative Samples
PDF Full Text Request
Related items