Font Size: a A A

Research On The Technology Of The Web Employment Information Extraction Based On The HTML

Posted on:2014-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:H M DaiFull Text:PDF
GTID:2268330425956676Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasing ubiquity of the computer and Internet, it has beenbecoming an important channel for People to seareeh for information. Asan enormous data source,retriving information from Web is one of the hotpoints in the information study field now.With the college enrollment in our country has been expanding eachyears, which gives the student education and employment many pressures.We hope to obtain a large amount about employment information frominternet,which has been provided guiding signifieance to specialtyconstruction and student employment.The most of this mass web data ofinternet are based on the semistructured HTML format. The text base onHTML structure is not strictly and the semantics is not clear.People can’tfind the required data quickly and accurately from the web data,how toquickly and accurately obtain these data is a urgent problem need toresolve.So in this paper it presents a new model based on HTML structurethat extracts information from web employment information. It iscomposed of HTML structure pretreatment module,table positioningmodule and information extraction module.The first,Jtidy is use to clean the Web Page code which is convertedinto XML documents.Then the DOM tree of Web information is found inthe analysis of XML.In the last,Through a large number of observation,we obtain the heuristic rules about locate the positon of the genuinetable and algorithms are designed and implemented.This paper considerssuch layout type as the cross-row and cross-column instance,which makeeach data unit and the corresponding property not corresponded, so tablesare standardized so that each row or column are aligned with the samenumber of cells.The experimental results performed on multiple Web sites showed that the approach for Web data extraction could extract employmentinformation in Web.It can be applied in extracts employment informationfrom Web and other further study and perform well.
Keywords/Search Tags:Information Extraction, HTML, DOM tree, Web table
PDF Full Text Request
Related items