Information Extraction System In Semantic Based Scientific Literature Sharing Platform

Posted on:2008-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Huang

Full Text:PDF

GTID:2178360272969562

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the popularity of Internet and personal computers, the number of scientific literature has been growing exponentially. In order to retrieve literatures quickly and accurately, it becomes more and more important to extract metadata of scientific literature. Howerer, there exist some drawbacks in current information extraction technologies. For example, it is hard to adapt them, and their performances are low.To address these problems, this paper proposes a template-based literature header information (including title, authors, abstract) extraction algorithm and a statistics-based tailer information (including title, author, source and year of references) extraction algorithm. The proposed algorithms fully take consider on the diverse characteristics of header information and tailer information in the information extraction system of SemreX which is a semantic based literature sharing platform. Moreover, the metrics of precision, recall, F-measure and accuracy are improved through methods of information extraction preprocessing, template definition/maching, style statistics, Polynomial Fitting, etc. The primary philosophy of the template-based header inforamtion algorithm is as follows. Some templates of header information and scientific literature are defined firstly. The templates of scientific literature are generated by combinating kinds of templates of header information. When it performs information extraction, the system will select the most suitable template, which generally has the largest weight for finite state automaton. Then, the header information can been extracted according to the selected templates.The primary philosophy of the statistics-based tailer inforamtion algorithm is as follows. The modes of tailer information and special symbols are statisticed firstly. Secondly, statistic data are fitted using polynomial model to produce the probability formula. Then, tailer information can be predicted by comparing probability. Finally, extracted information can be revised, filtrated and renewed.The information extraction system of SemreX is implemented by using Java and Perl language in Windows, and tests on it are performed. The function tests validate that the system works successfully and extracts header and tailer information of scientific literature more accurately. The test results indicate that the extraction precision for header information(title, authors and abstract) of scientific literature is 91.9%, 86.2% and 81.5% respectively, the extraction recall is 89.1%, 84.4% and 80.2% respectively, the extraction F-measure is 90.4%, 88.5% and 80.8% respectively, and the extraction accuracy is 96.3%, 80.2% and 88.4% respectively. The test results also indicate that the extraction precision for scientific literature's tailer information(title, authors, source and year) is 89.9%, 91.2%, 81.9% and 88.3% respectively, the extraction recall is 80.3%, 87.3%, 78.9% and 87.0% respectively, the extraction F-measure is 86.5%, 89.1%, 80.5% and 86.4% respectively, the extraction accuracy is 84.9%, 84.5%, 77.9% and 87.6% respectively.

Keywords/Search Tags:

Information Extraction, Template Matching, Finite State Automaton, Polynomial Fitting, Scientific Literature

PDF Full Text Request

Related items

1	Multiplier Based On Finite State Machine Design And Implementation
2	Research On Exploiting And Serving Mode Of Scientific Literature Information In Digital Environment
3	The Literature Information Retrieval And Matching From The Web
4	Research And Implementation Of Lip Reading System Based On Finite State Automaton
5	Research On Software Behavior Specification Mining Based On Extended Finite State Automaton
6	Research And Implementation Of Data Mining System For Scientific Literature
7	Research And Application Of Bilingual Terminology Extraction For Scientific Literature
8	Research And Analysis Of Dynamic Information Flow Monitoring Based On Finite State Automaton
9	Research And Implementation Of XML Document Publish/Subscribe System Based On Nondeterministic Finite Automaton
10	Bibliometric Analysis Of Output And Collabration Of China's Scientific Literature