Font Size: a A A

Information Extraction System In Semantic Based Scientific Literature Sharing Platform

Posted on:2008-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z W HuangFull Text:PDF
GTID:2178360272969562Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of Internet and personal computers, the number of scientific literature has been growing exponentially. In order to retrieve literatures quickly and accurately, it becomes more and more important to extract metadata of scientific literature. Howerer, there exist some drawbacks in current information extraction technologies. For example, it is hard to adapt them, and their performances are low.To address these problems, this paper proposes a template-based literature header information (including title, authors, abstract) extraction algorithm and a statistics-based tailer information (including title, author, source and year of references) extraction algorithm. The proposed algorithms fully take consider on the diverse characteristics of header information and tailer information in the information extraction system of SemreX which is a semantic based literature sharing platform. Moreover, the metrics of precision, recall, F-measure and accuracy are improved through methods of information extraction preprocessing, template definition/maching, style statistics, Polynomial Fitting, etc. The primary philosophy of the template-based header inforamtion algorithm is as follows. Some templates of header information and scientific literature are defined firstly. The templates of scientific literature are generated by combinating kinds of templates of header information. When it performs information extraction, the system will select the most suitable template, which generally has the largest weight for finite state automaton. Then, the header information can been extracted according to the selected templates.The primary philosophy of the statistics-based tailer inforamtion algorithm is as follows. The modes of tailer information and special symbols are statisticed firstly. Secondly, statistic data are fitted using polynomial model to produce the probability formula. Then, tailer information can be predicted by comparing probability. Finally, extracted information can be revised, filtrated and renewed.The information extraction system of SemreX is implemented by using Java and Perl language in Windows, and tests on it are performed. The function tests validate that the system works successfully and extracts header and tailer information of scientific literature more accurately. The test results indicate that the extraction precision for header information(title, authors and abstract) of scientific literature is 91.9%, 86.2% and 81.5% respectively, the extraction recall is 89.1%, 84.4% and 80.2% respectively, the extraction F-measure is 90.4%, 88.5% and 80.8% respectively, and the extraction accuracy is 96.3%, 80.2% and 88.4% respectively. The test results also indicate that the extraction precision for scientific literature's tailer information(title, authors, source and year) is 89.9%, 91.2%, 81.9% and 88.3% respectively, the extraction recall is 80.3%, 87.3%, 78.9% and 87.0% respectively, the extraction F-measure is 86.5%, 89.1%, 80.5% and 86.4% respectively, the extraction accuracy is 84.9%, 84.5%, 77.9% and 87.6% respectively.
Keywords/Search Tags:Information Extraction, Template Matching, Finite State Automaton, Polynomial Fitting, Scientific Literature
PDF Full Text Request
Related items