Font Size: a A A

Research On Geographic Information Retrieval

Posted on:2010-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z S LiFull Text:PDF
GTID:1118360275955559Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
WWW has become an important media for people to acquire information while information retrieval technique is a tool to help people to achieve this goal effectively and efficiently.Both academic and industry have put more and more efforts on the research of information retrieval due to its huge application potential on broad areas. On the other hand,with the development of geographic information system and the application of location-based services,geographic information has attracted great interest from people.Therefore,how to retrieve geographic-related information is an important research problem.Geographical Information Retrieval(GIR) concerns the retrieval of information involving some kind of spatial awareness.Given that many documents contain some kind of spatial reference,these geographical references (geo-references) are important for IR.Many works have been done on geographic information retrieval.These works are mainly about geographic entity extraction, geographic indexing,geographic ranking,geographic visualization and geographic data mining.In my PhD research work,I have done research in the context of geographic information retrieval in the following aspects:1.LDA-based document model for geographic information retrieval.Latent Dirichlet Allocation(LDA) model,a formal generative model,has been used to improve ad-hoc information retrieval recently.However,its feasibility and effectiveness for geographic information retrieval has not been explored.We propose an LDA-based document model to improve geographic information retrieval by inheriting the LDA model with conventional text retrieval models.The proposed model has been evaluated on GeoCLEF 2007 collection.The results show that the application of LDA model in GeoCLEF monolingual English task performs stably but needs to be further explored.2.Language model based model for geographic retrieval model.Different from conventional information retrieval,GIR has an additional query spatial scope to restrict the user's interested area.Existing works usually use the query spatial scope as a document filter in the ranking process.However, word specificity over geographic space does not distribute uniformly so that the importance of word should vary with the query spatial scope.So,we propose a new geographic information retrieval model,namely LALM, which stands for Location-Aware Language Model,to incorporate the query spatial scope into conventional language model.In LALM,we introduce a local model to capture the specificity of words whose locations are covered by a query spatial scope.Experiment results demonstrate the effectiveness of LALM.3.Implicit location based geographic indexing structure.Currently,geographic search engines don't consider implicit locations.For example,for a query "snowstorms in North America",traditional methods simply return all the web pages that include "North America".In fact,if a web page includes "Canada","United States of America",or "Mexico",it is also relevant to the query."North America" can be seen as the implicit location for "Canada". We define implicit locations as the ancestors of the explicit locations mentioned in the documents.We propose an implicit location based geographical indexing structure and compare its performance with different indexing methods.Experimental results show that our approach is better than previous ones.4.IR-tree:Efficient indexing structure for geographic document search.A geographic search engine retrieves documents that are textually and spatially relevant to queries specified with keywords and locations,and ranks the retrieved documents according to their joint textual and spatial relevances. Conventional geographic search engines use inverted files and R-tree to store the locations and documents separately and filter the irrelevant documents by query spatial scope in the query processing step.Such method is inefficient as the search and ranking processes are separate and sequential. On the other hand,people are usually interested in the top-k documents while current geographic search engines cannot guarantee efficient search of top-k results.Motivated by the shortcomings of existing approaches,we present an efficient index structure,namely IR-tree,for geographic document search.IR-tree explores four design principles,including spatial filtering,text filtering,storage overhead and rank computation.It is a hybrid index that combines both spatial index and inverted files in an innovative way.Different alternatives of TF-IDF summary are stored in the internal nodes of R-tree.During the search process,the relevant nodes are ranked by their TF-IDF scores and only the most relevant node is processed every time. So,IR-tree facilitates the spatial pruning and textual filtering seamlessly and supports top document search and ranking in an integrated fashion based on the Rank-based document search algorithm.Via experiments over a wide range of settings,we show the superiority of IR-tree in terms of search efficiency over the state-of-the-art approaches.5.Yellow pages query categorization.Yellow pages search engines provide a means for finding businesses close to particular locations.They are popular services and a rapidly evolving research area.The underlying data maintained in yellow pages search engines are typically labeled with Standard Industry Classification(SIC) categories and users search for yellow pages with categories of interest to themselves.Categorizing yellow pages queries into a subset of topical categories can help to improve search experience and optimize result quality.However,yellow pages queries are short and ambiguous.In addition,yellow pages query taxonomy is usually organized by a hierarchy of a huge number of categories.These characteristics make yellow pages query categorization difficult and challenging.In this paper,we propose an adaptive yellow pages query categorization technique.The proposed technique is based on a TF-IDF similarity category matching scheme that provides more accurate query categorization than previous keyword based matching schemes.To further improve the categorization performance,we design several filtering schemes. Through extensive experimentation,we demonstrate that our proposed technique is the most suitable for yellow pages query categorization;it largely outperforms previous approaches,including keyword based matching schemes and a hierarchical support vector machine classifier.This technique is very robust and avoids manual labeling work successfully. Moreover,it is suitable for different taxonomies with varied scale.
Keywords/Search Tags:Geographic information retrieval, geographic information extraction, geographic information retrieval model, geographic indexing structure, yellow pages query categorization, data mining
PDF Full Text Request
Related items