The Information Extraction Of Unstructured Document Extraction And Analysis

Posted on:2013-01-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Huo

Full Text:PDF

GTID:2218330362960983

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of computer science and IT technology, our life is full of countless digital data and information. Meanwhile,with the promotion of office automation(OA) process in China,computer has become an indispensable tool in our daily life,the data and information it generated came to a degree that we can not even estimate. Although the database storage technology can provide standardized and structured model of management, whether all the data generated in our daily life can be standardized to meet the requirements of database format, whether we can obtain important information from vast amount of data through excavation and analysis, or whether we can find the laws which are useful to us from those seemingly chaotic data remain a question to us. To solve the problem, traditional database storing and inquiring modes obviously can not meet the present requirement. Therefore, to collect and analyze of unstructured data such like documents has become a hot issue in present studies.The paper from practical point of view, focus on all kinds of common documents data creating from computer software by extracting, collecting, processing, storing into database, mining and analyzing, in combination with the actual needs in the work of a certain government body. Targeting those crucial technical difficulties, the paper developed a program according to the actual applications. By using Windows API to solve extraction compatibility problem, the paper tries to calibrate unstructured documents into half-structured data in Windows operating system and thus solves the problem of different interface calling all kinds of documents. With Chinese word segmentation technology, the system separates the unstructured document contents on the basis of People's Daily language library of January 1998, then extracts useful and concerning entity information such as peoples'names, addresses, telephone numbers, license plates, ID numbers, bank card numbers, email addresses, URL and etc., and stores these information into database. Thereby, we complete the task of structured data extraction and storage which enables us to analyze the structured data according to the actual application by means of model, and displays the result of the analysis graphically.

Keywords/Search Tags:

extract, unstructured data, information extraction, message extraction

PDF Full Text Request

Related items

1	Research Unstructured And Basic Information Extraction Technology In Shipbuilding
2	Information Extraction Research And Application From Network Data
3	Research And Application Of Techniques For Collection And Retrieval On Unstructured Data
4	Research On Knowledge Graph Construction And Representation For Unstructured Data
5	Research On Web Information Extraction For Domain In Information Integration System
6	Research On Spatial And Temporal Information Extraction In Unstructured Text
7	Research On Information Extraction And Fusion Of Knowledge Graph For Unstructured Data
8	Automatic Information Extraction In Unstructured Deep Web
9	Planar Surface Extraction For Complex Facades From Unstructured TLS Point Clouds
10	Semi-supervised Blog Information Extraction Techniques Based On Document Structure