Font Size: a A A

The Design And Implement Of Web Page Automatic Categorization And Storage Management System

Posted on:2011-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LiuFull Text:PDF
GTID:2178360308961574Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with rapid development of network and information technology, Web pages on the Internet were exponential growth, how to organize and deal with these vast amounts of information effectively, and how to search, filter and manage these resources better; these have become an urgent problem. Traditionally, web documents are classified manually. But because it is time-consuming and labor-intensive, it is impossible to classify the vast web documents manually. Due to this, the automatic web page categorization has been put forward and become one of the important technologies in information retrieval and information filter. Through web page classification, we can establish web page category database which may effectively organize and manage network resources, and enhance the efficiency of retrieval information. In addition, web page categorization technology can be applied in information filtering. For example, the preservation of the URL classification can be used for URL filtering system, web page classification model can be used in content filtering. Thus, the study how to efficiently and accurately classify web page, preserve the classification result permanently will be of great significance.First, we briefly introduce the working theory, procedure and development of web page categorization. On the basis of the analysis of the system requirements, we designed the overall structure of the system. Then we discussed the techniques and methods of each step of the working flow in detail, mainly including text representation models, the Chinese word segmentation algorithms and feature extraction algorithms. Also, we make some analysis comparison of several common used feature extraction algorithms. As for the requirements of the permanent preservation of the web page classification result, we proposed incremental storage and feedback queries strategy, effectively saving storage space. At the same the feedback queries strategy can make up the limitations of the web page collection. In view of the URL standardization of the process of store and query, we apply a new URL parsing method which is based on Nested FSM, improving the parsing efficiency and fault-tolerance performance.On the basis of the study of the web page classification and store technology, we proposed the design and the implement method of web page categorization and storage management system. Then we tested the important performance of system including information extraction, feature extraction algorithms, Weight calculation algorithm and store query function. The test result has achieved the system design aim.
Keywords/Search Tags:Automatic Web Page Classification, Information Extraction, Word Segmentation, Feature Extraction, Incremental Storage, Feedback Queries
PDF Full Text Request
Related items