Font Size: a A A

Web Named Entity Extraction Based On Link Path Search

Posted on:2014-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y H MaFull Text:PDF
GTID:2268330401488807Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computers and Internet, the amount of information isexponentially growing. As Web has become a huge repository of information, how toacquire the real needed knowledge effectively from large data is a challenging problem.Information extraction aims at identifying valuable information from unstructured orsemi-structured web text and transforming the information into structured data. As the keytechnique and subtask of information extraction, named entity extraction has drawn wideattention of scholars at home and abroad.Traditional named entity extraction methods need to manually label training dataset. Ingeneral, they focus on news texts which include fewer named entity categories and havehigher complexity. In the thesis, two novel Web named entity extraction methods areproposed. Without manual intervention, our methods are more effective and accurate toimprove the automation and portability. Our work is as follows:(1) We find some URL common features of homepages by analyzing URLs of pers-onal homepages in the training dataset. Thus, a homepage classifier can be built bycombining common and specific features.(2) We propose a method of named entity extraction of name based on link path search.This method integrates anchor texts and web titles to extract personal name. Because itcan avoid missing summary information caused by adjacent links, this method can extractpersonal name effectively. Experimental results conducted on25datasets show that theaverage accuracy is up to86.11%.(3) The method of named entity extraction of Email is based on “HttpClient” andregular expression. Experimental results show that the precision is92.41%, which basicallymeets the needs of applications.(4) We implement a prototype system of link path based Web named entity extraction.This system could make researchers focus on designing algorithms, conducting experi-ments and supporting real-world applications.
Keywords/Search Tags:Data Mining, Named Entity, Information Extraction, Link Path
PDF Full Text Request
Related items