Font Size: a A A

Design And Implementation Of Pdf Text Content Extraction System For Medical Knowledge

Posted on:2019-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2428330566997292Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of medical informatization,medical electronic data has been accumulated.Facing with the massive medical information and data resources,people often face a difficult problem which the amount of information is large but the available information is less.So how to obtain the hidden and useful knowledge is an urgent problem,and so knowledge mining emerges as the times require.The first step of knowledge mining is data acquisition.It is an important basis for knowledge mining to collect information of interest easily.This topic is a PDF text content extraction system for medical knowledge mining.Based on the background of medical knowledge mining system,this paper mainly expounds the current research status of medical knowledge mining,PDF document application and document format conversion technology,as well as the requirements analysis,system design,concrete implementation,system testing and other aspects.Through the above steps,the prototype of PDF document content extraction system is completed.As a subsystem of the medical knowledge mining system,this system has completed the PDF document parsing,the TXT format conversion scheme design and implementation,the XML format conversion scheme design and implementation.TXT format conversion module based on the characteristics of PDF document structure put forward a new analytical idea by ignoring the secondary information to locate the key position.On this basis,specific solutions for data streams of several filters are given.It involves the application of the open source tool PDF-Box,and describes how to extract the text content stream from the source code and decode it.By summarizing a large number of PDF documents,the XML conversion module defines a new markup rule which establishes a mapping of the markup rule to the XML pattern,and implements a transformation strategy from the PDF format to the XML format.Finally,through the actual test,it is proved that this system can complete the automatic text content extraction,and it is beneficial to the further development and utilization of PDF in the medical information processing field,and it is of great significance to the research of the current medical knowledge mining topic.
Keywords/Search Tags:knowledge mining, PDF documents, file parsing, text extraction, XML files
PDF Full Text Request
Related items