Research On Web Page Classification Algorithm Based On Statistics

Posted on:2017-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:Q F Meng

Full Text:PDF

GTID:2308330482484189

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the development of computer and network information technology especially wide application of internet,the information resource on Web develops rapidly, the content of web page is more and more complicated and abundant. How to systemically classify Web pages and obtain accurate information from a tremendous amount of information becomes a significant topic which is studied by researchers.Web page classification is one important method of internet information classification and big data field, and it is mainly used to classify the massive web pages, which is convenient for people to collect information they may require. Traditional text classification technology is applied in many fields, mainly used in natural language processing, content filtering, information collecting. Text classification technology became more and more mature from 1990 s, it is mainly based on the training corpus to generate training model, and then classify the unknown documents to determine the categories of unknown documents.Web page content displays its information resource in dynamic way with semi-structure or structure, however, above 80% of information is composed by Chinese text, if webpages are classified by traditional text classification algorithm,it will need URL crawler on Webpages firstly to get web documents with label HTML, and remove the label HTML and generate text with traditional format and then classify the traditional text. Researchers have studied text classification technology deeply since its development, for example, word segmentation technology,feature extraction technology and text classification technology, each technology is mature. However, the study on feature words from feature extraction and how to weight the problem is not deep enough. Base on the current situation of text classification study, the paper studies the combination of information gain feature extraction algorithm and ITC weight algorithm to classify and correct and revise current ITC weight algorithm in which to improve accuracy of classification.On the other hand, the thesis realizes systematical classification on webpages and applies classified result in to firewall system to reach URL filter module. Make sure accuracy of firewall URL filter technology and support the update on firewall.

Keywords/Search Tags:

Webpage C lassification, Crawler, ITC Weight Algorithm, Feature Extraction, URL Filter

PDF Full Text Request

Related items

1	Design And Implementation Of Webpage Tampering Monitoring System
2	Webpage Content Extraction Techniques For Specific Topic
3	The Design And Implementation Of Distributed Web Crawler System Based On Automatic Extraction Of Webpage Information
4	Design And Implementation Of Keywords-based Microblog Crawler System
5	Webpage Text Extraction And Bilingual Website Detetion Based On Multi-feature Fusion
6	Using Webpage's Features Recognize Porn Webpage
7	Research On WEB Entity Information Extraction Algorithm And Its Application
8	Active Intelligent Detection Of False Webpages Based On Web Crawler
9	Research On Webpage Text Extraction And Management Based On Internet Information Retrieval
10	Study On Webpage Mount Detection Technology