Font Size: a A A

Geological Text Information Extraction Technology

Posted on:2008-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:W WuFull Text:PDF
GTID:2120360212983507Subject:Earth Exploration and Information Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology in geoscience, mineral evaluation has made so much spatial data and documents that it is such a difficult to search the precise target document quickly. And these documents contain physical and chemical testing data, geology survy reports , geographic graphs and GIS data. A special search engine in this area should occur under these bad aquirements to search such kinds of spatial data. However, we design an unusual search engine called GeoSou who can be running on the LAN(Local Network) and indexing all the shared documents in the LAN. This system has make it convinent in the way of geodata sharing and impoves the work process effienctly.GeoSou is composed of five modules: OS, System IO, Web server, Indexer, LAN spider and Query modul, and all of these modules are driven by state machine which is buit in the system, and GeoSou has unique features as follow:Firstly, basing on geo-dictonary, GeoSou has implemented a word segment system. Given the complexity of Trie tree, two hash table manage the first word and the last word of a geo-item and the reaming words are built on Trie tree whose degree has decreased in that way.Secondly, to improve the query speed GeoSou defines a linear space in which every document plays a role as an element, and in the space GeoSou maps string comparaton operation to logic calculaton. before the similary algorithm, GeoSou can judge the necessary to calculate the similary.Thirdly, GeoSou accepts vector space module that map documents to anothor space in which document comparation calculation is processed as vector calculation. Every feartures of a document is one demesion of a vector so that the relation between documents is mapped to the relation of vectors and the relation between a document and a keyword is relation of dot and vector, in that way, it is easy to judge the similary of documents.Last but not least, GeoSou has used bloom filter algorithm to remove url duplacats and has built a light http web server to parse interact information between server and clients.
Keywords/Search Tags:GeoSou, Trie Tree, VSM, query regularity
PDF Full Text Request
Related items