Font Size: a A A

Research And Application Of Word Segmentation Technology In Heterogeneous Data’s Unified Retrieval

Posted on:2013-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:X M HanFull Text:PDF
GTID:2248330362470885Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the high-speed development of informatization, various kinds of data accumulaterapidly,and data structures become more and more complex.Facing so much information,whichespecially have big differences on logical structure and storage structure, how to easily, quickly andaccurately search the effective information to gain important resources, is the urgent need of people inthe information age.To solve the problem of heterogeneous data unified retrieval, this thesis presentsan unified retrieval system of heterogeneous data and brings in the segmentation technology forimproving information retrieval precision and efficiency of searching system.This thesis introduces the research status at home and aboad on word segmentation andheterogeneous data retrieval.It analyzes and summarizes basic theory, common technologies andsolutions, typical algorithms and so on,which are about word segmentation and heterogeneous dataretrieval. Based on this, it puts forward the heterogeneous data retrieval general framework, andintroduces an overview of the framework level division,the function module of different levels, theoperation of the system process and the characteristics of the structure in detail.After analyzingtraditonal word segmentation algorithm and dictionary mechanism,it designs a fast word segmentationmethod,which combines the characteristics of heterogeneous data retrieval and is based on modifiedwhole word dichotomy dictionary. And in addition, concrete realization of the algorithm is given.Theexperimental results of the algorithm shows it can divide the text into words precisely and respondquickly.It does well in the segmentation of queries in heterogeneous retrival, extraction of key wordsand comparison of search results’ similarity. It studies the method of calculating similarity whichconsists of hard core in the layer of retrieval results’ processing.Similarity calculating algorithm basedon bayesian model is devised.And for improving retrieval efficiency,improved fast word segmentationis applied in the pretreatment of calculating similarity.Finally, the word segmentation technology in the heterogeneous data unified retrieval is appliedto the ship information management system of a provincial affair bureau. The application results showthat data retrieval coverage, response time of retrieval system, retrieval precision have obviousascending.It can solve the problem of heterogeneous data unified retrieval effectively.
Keywords/Search Tags:Words segmentation, heterogeneous data retrieval, meta search engine, XMLdocument, bayes classifier
PDF Full Text Request
Related items