Research Of Chinese Page Automatic Classification Based On Representive Samples

Posted on:2011-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:L P Ren

Full Text:PDF

GTID:2178360302499966

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the develop of the network technology, webpage quantity is increasing. It is necessary for webpage classification. Offer directly effective information to user. It is necessary for information retrieval technology to be improved. Search engineer has become the most popular web service on the Internet. Page automatic classification is a useful tool to improve the performance of search engine. It is a key procedure of the Chinese information process. Page classification can automatically classify web pages to certain topics to show users a legible topic list and make user find what they want much easily. The research of page classification has become a hotspot in information retrieval field. Page classification can reduce the numbers of pages user explore, improve the efficiency of search engineering. Page classification also contributes to information resource management, theme dictionary contribution and search engineering development.In the paper, first of all, the author analyzes the development background of this system and the development circumstance of the information management of page classification; summarizes the main problems needed to be resolved; briefly introduces main technique and main contribution adopted by system. Introduce the principle, procedure and relative technology of page classification, including page pre-process, vector space model and feature extraction. Then this paper discusses several popular text classification algorithms and their metewand. At last summarize page classification's advantage to search engine. By comparing these algorithms, find that KNN is the most favorable one for web page classification and is the keystone to research for our work. Research KNN (K-Nearest Neighbors) algorithm deeply, analyze the defects of the KNN, and propose an improvement method-reprehensive sample algorithm to make KNN more efficiency. On the design of system, author proceeds strictly according to unified modeling language UML. In realization part, system is in Windows XP operating system, with:6.0 Visual C developing instrument, combine HTML Parser, Boost relevant packages design come out place.The classifying device includes:keep and deal with the module, read the module, turn into module, calculation module on behalf of sample, person who stand for election tabulation turn into module, result tabulation turn into module, categorized result reveal module, result assess and test module, etc.. The system will improve the efficiency of searching, the classification fast and accurately, help users inquire fast that arrive relevant information of needing.

Keywords/Search Tags:

Page classification, Vector Space Model, KNN Algorithm, Representative Samples

PDF Full Text Request

Related items

1	Research Of Chinese Page Automatic Classification Based On Vector Space Model
2	Research And Application Of Chinese Web Pages Automatic Classification
3	Research And Realization Of Term Selection In Chinese Web Page Classification Based On VSM
4	The Research And Implementation Of Web Page Classification In Enterprise Search Engine
5	Research And Implementation On A Web Page Classification System
6	Web Page Information Filtering Method Research Based On Vector Space Model
7	Optimizing Web Page Classification Algorithm By Using Hyperlinks
8	Research And Implementation Of Content Oriented Web Page Classification
9	Research On Improved KNN Chinese Web Page Classification Based On Weka Platform
10	Web Page Information Filtering Method Research Based On Vector Space Model