Font Size: a A A

Research And Application Of Extraction Method Of Semi-structured Text Information

Posted on:2015-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WangFull Text:PDF
GTID:2298330422488491Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of office automation, the data storage and representationof the enterprises and government agencies, not only including the traditional structureddata such as relational database and object-oriented databases, but also thosesemi-structured data like Excel, Xml, Html, together with audio, images, video, the originaltext documents which have unstructured data model and self-described features, manifestdistributed and heterogeneous characteristics. All the companies and governmentdepartments have plenty of data with different structures. They will choose disparate storagemodes relying on the corresponded data structures of various data types. Therefore, in orderto achieve the aim of searching and sharing data of varied structures among companies andgovernment departments, data integration of different structures has become an importantresearch topic in network and database application as well as in practical use.This paper mainly studies the integration of semi-structured and structured data bychoosing a typical semi-structured data in Excel form to summarize, analyze, induce andclassify hundreds of Excel spreadsheets of different industries and forms according to theirstructures. On the basis of manual and programming extraction of the semi-structured datathis paper summarizes some typical extraction rules, formally describes these rules asdifferent instructions, and forms a set of semi-structured Excel spreadsheet data extractioninstruction system. Finally, this paper proposes a versatile Excel spreadsheet data extractionmodel, based on the instruction system.The Excel spreadsheet data extraction model can not only quickly and accuratelyperform a specific Excel spreadsheet data extraction and loading, but also automaticallyextract and load different types of Excel spreadsheet data flexibly by modifying theconfiguration files. The model’s scalability can be implemented by the command interpretervia the interpretation of the rules database, which is more universal. The model has beenused in several projects and packaged to be a Web Service on the company’s servers,enabling convenient usage of different projects, thus proving its good versatility and actualvalue.
Keywords/Search Tags:Semi-structured data, Excel form, Data extraction, Import data, Instructionsystem
PDF Full Text Request
Related items