Font Size: a A A

Research On Traffic Terminology Similarity Matchment Based On Topic Vertical Search Engine

Posted on:2014-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:F WangFull Text:PDF
GTID:2268330422961862Subject:Traffic Information Engineering & Control
Abstract/Summary:PDF Full Text Request
The research on similarity calculation in different field of terminology words, isthe premise and foundation of data mining, natural language processing in each area.Web-PMI is a kind of term similarity calculated algorithm based on the numberof return hits of the search engine.Use return of search engines of the terminologyretrieval hits,it can calculate the similarity of the terminology words quantitatively.But in particular area, the hits retrieval number was limit based on the large universalsearch engine, it often caused bad effect in similarity calculation. The purpose of thispaper is to through the establishment of vertical search engine system, Improve theeffect of terminology retrieval hits accuracy,to improve accuracy of terminologywords similarity calculation.First, A vertical search engine based on transportation is studied and realized inthispaper.This paper had self-developed the crawled program of the traffic theme web,under the framework of open source crawler Heritrix project. The program excellentlycompleted the work of crawling transportation topic pages.Second,the fixed format of crawled web pages is analyzed,the redundantinformation in the web page is filtered out, and the index library of the retrievalsystem is built.The index library of this paper is self-developed under the open sourceLucene conditions, create orderly index of the parsed traffic theme web pages, and itcan realize fulltext retrieval of the transport terms in the index library.Retrieve thespecific accuracy numerical in the term index library after search.At last,based on this traffic topic vertical search engine system, use Web-PMIalgorithm to conduct a test of the traffic terminology similarity calculation,toreconstruct the retrieval model based on transportation terms in the algorithm,joiningthe new search operator, reduce the ambiguity in the search results, to improve thefield relevance of the retrieved results, and improve performance of the algorithm.Analyzed the experimental results, the experimental results show that, the newretrieval model improved the terms of the retrieval accuracy, eliminates bad effects of term similarity calculation,because of accidental concurrence of the term.The method of this paper proposed, was carried out under the "transportationinformation consistency detection research" project,application results show that,based on traffic vertical search engine system of this paper,it can have very goodeffect of similarity computation in rarely used words of the transportation field,andthe calculation accuracy also slightly higher than large commercial search enginesAlta Vista. The purpose of this paper is to put forward a kind of traffic terminologysimilarity computing solutions,this method is also applicable to other area termsimilarity calculation,it can also carry on the effective support of terminologystandardization work or Identify synonyms and near synonyms or semantic search orterminology standard analog detection and some other work.
Keywords/Search Tags:Terminology similarity, Semantic similarity computation, Vertical searchengine, Heritrix, Lucene, Web-PMI algorithm
PDF Full Text Request
Related items