Font Size: a A A

Design And Implementation Of Unstructured Document Extraction And Analysis System

Posted on:2013-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y P WangFull Text:PDF
GTID:2248330362460984Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the developing of computer science and network technology, people depend on the network and computer more widely, the digital information affect us in many way and become indispensable in our life. As ages wend by, we get more and more digital information which normally be performed as document and data. Now, database and storage technology provides methods to manage those huge data and information in a structured way. But we cant easily and rationally calibrate all of the data into structure database, especially documents. So how to transform all the data and document in a structured mode? How to analyse and mine useful information from the massive data?By this system, we try to process all kinds of common documents by extracting, collecting,storing into database, mining and analysing. First, we try to calibrate unstructured documents into half-structured data, while extract its document name,creating time and document content. Here we use Windows API to solve extration compatibility problem, so we solve the difficult problem of different interface calling all kinds of documents.Second, in order to change the half-structured data into structured data, the system uses the chinese word segmentation technology to separate words, on the basis of People’s daily language library on January 1998. The system extract useful and concernful entity messages of people’s name, address, phone number, license plate, ID number, email address, bank card number, URL address and so on, and store these useful message into database, thereby we finish the structured data extration and storage.The third, the system is designed to be distributed computing system according to the real application environments. Because of the original pending unstructured documents is such massive, we try to distribute the computing and analysing to several nodes, taking advantage of the local area network and database character. The master node is responsible for the original documents’storage and extraction, the child nodes are responsible for the data processing. All the nodes are full load, then we resolved the problem of idle resource and processing bottleneck. Experiments show that it takes 15 minutes for 10 Gate documents’extraction and collection with 6 nodes, and the result is acceptable.Finally, the system provides visual graph to show the analysis result according to the actural requirment. All kinds of graph demonstrate the data analysis result which simulate the HuSpring dynamic arithmetic, and the visual graphs help users easy to understand the outcome.This project has been applied technical knowledge such as data collection, extraction, distributed computing, graphic display and so on. System has been developed function modules for reuse and division, which accord with the modern software development standard.
Keywords/Search Tags:unstructured data, distributed, structural transform, model analysis, graphic display
PDF Full Text Request
Related items