Font Size: a A A

Research On Semi-structure Information Extraction For Web

Posted on:2010-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:S Q ZhouFull Text:PDF
GTID:2178360272979365Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet, more and more people obtain information from Web. To find necessary information quickly and efficiently from Web has become a serious problem. Web information extraction technology comes into bring. Many approaches have been proposed to generate wrapper, but they have too different limitations to make wrapper accurate, robust or general. So, the preparing better wrapper has become the research emphases of information extraction.After having analyzed and researched the technologies of XML and information extraction, a system of Web information extraction based on XML is developed in this paper. With this system, users can extract interested information from HTML pages, the extraction results are expressed in XML which have strong structure and expansion. The system has the generality and flexibility. Users can quickly customize the web information extraction wrapper applied to different areas. In this paper, by using the character of the XPath positioning technology in data area, a algorithm of XPath based on DOM is implemented. XSLT is used as the description language of extraction rules and XPath is used to locate information to be extracted.The method in Web information extraction presented in this dissertation can better solve the problem of Web information extraction, and also the precision and recall of the system can reach a higher proportion.
Keywords/Search Tags:data mining, information extraction, semi-structured data, Web
PDF Full Text Request
Related items