Font Size: a A A

The Information Extraction Of Unstructured Document Extraction And Analysis

Posted on:2013-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuoFull Text:PDF
GTID:2218330362960983Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of computer science and IT technology, our life is full of countless digital data and information. Meanwhile,with the promotion of office automation(OA) process in China,computer has become an indispensable tool in our daily life,the data and information it generated came to a degree that we can not even estimate. Although the database storage technology can provide standardized and structured model of management, whether all the data generated in our daily life can be standardized to meet the requirements of database format, whether we can obtain important information from vast amount of data through excavation and analysis, or whether we can find the laws which are useful to us from those seemingly chaotic data remain a question to us. To solve the problem, traditional database storing and inquiring modes obviously can not meet the present requirement. Therefore, to collect and analyze of unstructured data such like documents has become a hot issue in present studies.The paper from practical point of view, focus on all kinds of common documents data creating from computer software by extracting, collecting, processing, storing into database, mining and analyzing, in combination with the actual needs in the work of a certain government body. Targeting those crucial technical difficulties, the paper developed a program according to the actual applications. By using Windows API to solve extraction compatibility problem, the paper tries to calibrate unstructured documents into half-structured data in Windows operating system and thus solves the problem of different interface calling all kinds of documents. With Chinese word segmentation technology, the system separates the unstructured document contents on the basis of People's Daily language library of January 1998, then extracts useful and concerning entity information such as peoples'names, addresses, telephone numbers, license plates, ID numbers, bank card numbers, email addresses, URL and etc., and stores these information into database. Thereby, we complete the task of structured data extraction and storage which enables us to analyze the structured data according to the actual application by means of model, and displays the result of the analysis graphically.
Keywords/Search Tags:extract, unstructured data, information extraction, message extraction
PDF Full Text Request
Related items