Research On Web Page Classification And Information Collection

Posted on:2018-01-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Wen

Full Text:PDF

GTID:2348330512482975

Subject:Information and Communication Engineering

Abstract/Summary:

Web page classification and information collection system includes web crawling,web page identification and text collection.Which rely on the traditional way to manually identify the web page information in the rapid growth of the conditions under the conditions is unreasonable.At the same time,Web pages contain a lot of noise information for web page text collection increased the difficulty.The existing text acquisition technology has the advantages of large manual maintenance cost,low accuracy and low versatility.Therefore,the automatic identification of web pages and text collection research has become an important direction.They are combined with information retrieval,search engine,network public opinion and text recommendation to facilitate the acquisition of information.The main contents of this thesis are as follows:(1)According to the requirement of web crawler system of intelligence and a priori information,an automatic identification method of web page type based on Web structure feature mining is proposed.The research focus of this method is the selection of features.On the basis of understanding the characteristics of web page mining,this paper studies the different parts of different web pages,and extracts the feature set which can represent the website.Using the classic classification algorithm(decision tree)to build the classifier,so as to achieve the purpose of automatic Chinese web page recognition.(2)Under the request of text acquisition automation,a method of BBS web page text extraction based on HTML tag feature mining is proposed,namely text block extraction.The central idea is based on the following characteristics: the tree structure of the document,multi-text center,the level of the label elements and so on.On this basis,a method of BBS web page text extraction based on intelligent template is proposed.The main idea is to find the required multi-text block public information by using the BBS web page text extraction method based on the HTML tag feature mining.And then automatically configure the corresponding text resolution template for the site.Finally,the template is used to parse the page text.(3)Set up a web page classification and information collection system.This article system includes web crawl,page recognition,web page text extraction and UI section.Which Web crawl part of the common use of crawling technology and processes.The goal is to search the entire network.Web page recognition using this page based on the Web page feature set of automatic identification of the page type.Web page text extraction is part of the text based on the smart template BBS web page text extraction method.In summary,the use of the actual data on the system after the test method.The experimental results show that the method is feasible,highly accurate,versatile and intelligent in the system.

Keywords/Search Tags:

web crawler, web page automatic identification, web page text extraction, smart template, machine learning

Related items

1	Reasrch On The Intelligent Acquisition Of Web-Based News Contents
2	Research On Relation Extraction Of Person Entity In News Webpage
3	Research On Multi-page Special Web Page Text Extraction And Merging Technology
4	Based On Templated Web Crawler Technology Of Web Page Information Extraction
5	Research On Web-based Full-station Data Information Extraction Based On Template
6	Research On Focused Crawler Based On Page Segmentation
7	Research And Implementation On Key Technology Of Web Text Collection And Analysis
8	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
9	Printed Documents Source Identification Using Geometric Distortion On Text Lines
10	Research On APK Crawler With Automatic Pagination Detection And Search Results Extraction