Font Size: a A A

Research On Web Page Classification Algorithm Based On Statistics

Posted on:2017-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q F MengFull Text:PDF
GTID:2308330482484189Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the development of computer and network information technology especially wide application of internet,the information resource on Web develops rapidly, the content of web page is more and more complicated and abundant. How to systemically classify Web pages and obtain accurate information from a tremendous amount of information becomes a significant topic which is studied by researchers.Web page classification is one important method of internet information classification and big data field, and it is mainly used to classify the massive web pages, which is convenient for people to collect information they may require. Traditional text classification technology is applied in many fields, mainly used in natural language processing, content filtering, information collecting. Text classification technology became more and more mature from 1990 s, it is mainly based on the training corpus to generate training model, and then classify the unknown documents to determine the categories of unknown documents.Web page content displays its information resource in dynamic way with semi-structure or structure, however, above 80% of information is composed by Chinese text, if webpages are classified by traditional text classification algorithm,it will need URL crawler on Webpages firstly to get web documents with label HTML, and remove the label HTML and generate text with traditional format and then classify the traditional text. Researchers have studied text classification technology deeply since its development, for example, word segmentation technology,feature extraction technology and text classification technology, each technology is mature. However, the study on feature words from feature extraction and how to weight the problem is not deep enough. Base on the current situation of text classification study, the paper studies the combination of information gain feature extraction algorithm and ITC weight algorithm to classify and correct and revise current ITC weight algorithm in which to improve accuracy of classification.On the other hand, the thesis realizes systematical classification on webpages and applies classified result in to firewall system to reach URL filter module. Make sure accuracy of firewall URL filter technology and support the update on firewall.
Keywords/Search Tags:Webpage C lassification, Crawler, ITC Weight Algorithm, Feature Extraction, URL Filter
PDF Full Text Request
Related items