Design And Implementation Of Pdf Text Content Extraction System For Medical Knowledge

Posted on:2019-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Liu

Full Text:PDF

GTID:2428330566997292

Subject:Software engineering

Abstract/Summary:

With the development of medical informatization,medical electronic data has been accumulated.Facing with the massive medical information and data resources,people often face a difficult problem which the amount of information is large but the available information is less.So how to obtain the hidden and useful knowledge is an urgent problem,and so knowledge mining emerges as the times require.The first step of knowledge mining is data acquisition.It is an important basis for knowledge mining to collect information of interest easily.This topic is a PDF text content extraction system for medical knowledge mining.Based on the background of medical knowledge mining system,this paper mainly expounds the current research status of medical knowledge mining,PDF document application and document format conversion technology,as well as the requirements analysis,system design,concrete implementation,system testing and other aspects.Through the above steps,the prototype of PDF document content extraction system is completed.As a subsystem of the medical knowledge mining system,this system has completed the PDF document parsing,the TXT format conversion scheme design and implementation,the XML format conversion scheme design and implementation.TXT format conversion module based on the characteristics of PDF document structure put forward a new analytical idea by ignoring the secondary information to locate the key position.On this basis,specific solutions for data streams of several filters are given.It involves the application of the open source tool PDF-Box,and describes how to extract the text content stream from the source code and decode it.By summarizing a large number of PDF documents,the XML conversion module defines a new markup rule which establishes a mapping of the markup rule to the XML pattern,and implements a transformation strategy from the PDF format to the XML format.Finally,through the actual test,it is proved that this system can complete the automatic text content extraction,and it is beneficial to the further development and utilization of PDF in the medical information processing field,and it is of great significance to the research of the current medical knowledge mining topic.

Keywords/Search Tags:

knowledge mining, PDF documents, file parsing, text extraction, XML files

Related items

1	Text mining with the exploitation of user's background knowledge: Discovering novel association rules from text
2	Research On Neuropeptides Extraction Based On Text Mining
3	Research On Product Patent Design Knowledge Extraction Technology Based On Text Mining
4	Knowledge discovery and hypothesis generation from biomedical literature using text mining
5	Research And Application Of Person Figure Mining Based On Text Analysis
6	Research And Realization Of Text Mining System For Project Files
7	Design And Implementation Of Web Document Extraction And Offline Collection System
8	The Application Of Text Mining Technology In The Analysis Of Academic Figures
9	Research On File Infection Method Based On File Parsing
10	Knowledge Extraction From Document-level Formatted Text