Font Size: a A A

An Information Extraction System For DynamicView

Posted on:2007-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:J HeFull Text:PDF
GTID:2178360212965573Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the growing of World Wide Web (WWW), WIR and WIE techonology has been developed rapidly. More and more researchers are paying attation to how to extraction information from the Web.WIR can be used for locating a specific page that contains the relevant information on the Web. Unlike WIR, WIE can extract revelant information from a specific page directly and transform the revelant information into structural format.Generally speaking, WIE methods can be divided into two categories: one is based on structure of a page, such as Page Structure Grammar Inference and Page Segmentation; the other is based on language feature of a page, such as template filling. Unlike free text information extraction, the number of annotated web page for a specific domain on the Web is small. Hence, how to extract information with high accuracy without increasing the tedious manual work is a critical problem to be solved.Based on the analysis of existing WIR and WIE algorithm and the target of DynamicView project, this thesis proposes a WIR algorithm based on structure template to get the faculty's homepage from the Web and a page segmentation based WIE algorithm to extract the facultys'research interest from their homepages. The WIR algorithm applies WIE technology into WIR algorithm. In this way, the web pages with the same attributes can be found easily. The page segmentation algorithm DeSeA (Delimiter based Segmentation Algorithm) for WIE can be used to filter irrelevant information out in a web page. After this, research interestes can be extracted easily from the relevant segments using the domain knowledge. Experiments show that these two algorithms fit commendably with DynamicView.
Keywords/Search Tags:Web Information Retrieval, Web Information Extraction, Machine Learning, Semantic Web
PDF Full Text Request
Related items