Font Size: a A A

Design And Implementation Of Education News Webpage Information Extraction System

Posted on:2013-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:L F RenFull Text:PDF
GTID:2248330395975397Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The Internet is the world’s richest and most dense source of information, in the recentyears, with the explosive growth of network information, Web news page information hasbecome a main channel for people to obtain information. How to find pages need for the userin the flood of information pages has become a hot research topic in the field of informationprocessing.This paper mainly focused on solving the problem of information extraction of educationnews web pages. Based on the above research a system of web education news extraction wasdesigned and implemented, which can help user to get information quickly and easily afteracquiring the key information of news pages.The news pages were segmented into different blocks based on the structure. Firstly thepages were divided into many fields according to the html tags <table> and <div> as well as afew simple heuristic rules; Then according to the features of each field pages are divided intodifferent areas, such as navigation area, hyperlinks area, footer area, non-display area, maintext area and so on. After having the areas which do not contain the key information of thenews removed, the rest content is what we need to process in the next step.In the news information extraction we used two methods. One is based on heuristic rules,which were generated by statistical analysis of the features and structures of large numberweb news pages. This extraction method is fast and accuracy within a small range of webpages, but if there exist pages with new structure, the heuristic rules should be modifiedaccording the new structure, thus this method is not very flexible. The other method is basedon the Hidden Markov Model (HMM). This method is applicable to web pages with differentstructures, but it need to mark sample pages and and do training and learning from them, sothe extraction time is long than the former. In this paper we combine two methods, use thefirst one method to process and mark the sample pages, then use the HMM model to extractinformation.The testing experimental result on a large number of education news web pages showsthat the methods of webpage pretrement, segmentation, hybrid information extraction basedon heuristic and HMM used in our system are feasible, the information extraction accuracyand efficiency could meet our actual needs and the system have practical value in use.
Keywords/Search Tags:Information extraction, Web page segmentation, Heuristic rules, Hidden MarkovModel (HMM)
PDF Full Text Request
Related items