Research On Vertical Search Engine Of Recency-sensitive Objects

Posted on:2012-12-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Wu

Full Text:PDF

GTID:1119330332975933

Subject:Computer software

Abstract/Summary:

PDF Full Text Request

With the more and more popularity of search engine services, domain-related search requests become more and more clear. The requirements for personal search and recency-sensitive search gradually heightened. As a result, efficient information retrieval based on vertical search engines has become the issues of the search engine domain. By using fo-cused crawling, intelligent scheduling and high-dimensional indexing techniques, as well as based on domain knowledge and personality, vertical search engines provides up to date, more personality-aware and more professional search results.However, the major problems exist in most vertical search engines are as follows: (1) the passive crawling mode for crawler system results in a long delay between user query and result retrieval. (2) the scheduler of crawler system schedules web page crawling driftless, which makes a very low utilization for crawling resources. (3) the performance of indexing system is not settle for online updates, and the merging results for certain unstructured text objects are terrible. This paper conducts fully study of these problems as well as the related key technologies.The major contributions of the paper are presented in the following:Firstly, it proposes a semantic based query triggered crawling (QTC) technique to settle the problem of long delay between user query and result retrieval caused by passive crawlers. Based on domain knowledge, QTC translates user query to request parameters of potential target results on domain web sites, and implements an active crawling technique focused on current user queries to solve the problem. Extensive experiments and beta test in real commercial applications show that QTC bridges the delay gap between user query and result retrieval, and brings 10-second-level freshness in vertical search results.Secondly, it proposes an object-level change-aware resource scheduling technique to settle the problem of low utilization of crawling resources caused by crawling blindly. This technique named Poisson-Rank which uses Poisson process to model the time of web ob-ject changing sequence. The Poisson process model provides a quantitative estimation of object-level freshness. By scheduling the crawler resources according to estimated object freshness, this technique not only improves the resource utilization but also captures the changing rule for objects more accurate. Extensive experiments in real data show the ac-curacy of object freshness estimation for Poisson process model, and improved resource utilization with nearly zero-extra-costs in performance.Thirdly, it proposes a more efficient high-dimensional indexing technique to address the performance problem of traditional high-dimensional indexing methods. This tech-nique named CB-LSH combines Compressed Bitmap index and Locality-Sensitive Hash-ing index. CB-LSH booleanizes each operator in LSH index and brings CB into LSH. CB-LSH greatly improves the performance and solved the online update problem for high-dimensional indexing. Theoretical analysis proves the improvements. Extensive experi-ments show that CB-LSH achieves 1/3 less memory usage,10 times of index deletion performance,4 times of query performance and 1.5 times of insert performance. Applica-tions in real commercial projects showed that CB-LSH is feasible for online updates in a large image retrieval system.Fourthly, it proposes a text clustering technique inspired by trigger-pairs in natural language to improve the clustering results of traditional text clustering algorithms for un-structured text data. Unstructured text data in e-commerce has the properties of very short length, noisy and professional vocabulary, which make the traditional text clustering al-gorithms useless. Trigger-pair based clustering technique (TrigSigs) uncovers hidden re-lations between words, adapts professional vocabulary and extracts key word features to enable a fine-granularity object level clustering technique. Simulation experiments show that this technique could filter out most noises, make efficient weight distribution between word features and greatly improve the-clustering results.

Keywords/Search Tags:

vertical search, recency sensitive, query triggered, focused crawl, schedule algorithm, high-dimensional index, unstructured text

PDF Full Text Request

Related items

1	Study On Automobile Sale Forecast Method Based On Network Search Data
2	Research On Customer’s Satisfaction Index Based On Unstructured Text From The Perspective Of Big Data
3	Research On The Application Of Web Search Data On Predicting Real Estate Price Index
4	Effect Of Text Reviews And Reviews Volume On Product Sales Under The Improved Text Mining Method
5	Intelligent query for real estate search
6	Theory And Application Of Structured High-dimensional Multiple-index Models
7	High-dimensional Multi-objective Evolutionary Algorithm Based On Angle Selection And Dynamic Penalty
8	Recency And Confirmatory Effect In Audit Judgment
9	Prediction Of Kunming Commodity Price Index Based On Keyword Search
10	Research On The Production Scheduling Problems Based On Intelligent Optimization Algorithms