Font Size: a A A

LSH-based Similarity Search Of Web Data

Posted on:2012-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:P S YuanFull Text:PDF
GTID:1488303356971339Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the technique development and widespread applications of the Web, massive data with numerous formats is swelling on the Web. The management and the search for these data are one of the most important steps in Web applications. The similarity search is an important way for obtaining information from the web, furthermore, it is one of the most important applications of the Web. The similarity search of the Web is diversified which results from various data formats and applications of the Web. Due to the high dimensionality of the Web data, the similarity search for the Web data is not efficient. For trading the query performance and effectiveness, the approximate way of the similarity search is proposed. The locality sensitive hashing technique (LSH for short) is the state-of-the-art of the approximate similarity search, which received widely attention and applied in various fields for approximate similarity search.In this thesis, data on the web are classified firstly. Then the locality sensitive hash-ing technique is surveyed. In the light of the characteristics and challenges of the web data management, we make use of the locality sensitive hashing technique for the simi-larity search of the web data in the following three aspects:for the massive of the semi-structured XML data, the structural similarity search on the cluster platform is studied and implemented; for the string data, based on the hashing technique, the similarity join is researched and the performance is improved; for the text data similarity search, the learning-based framework for the similarity search and algorithm are investigated. The contributions of this thesis are summarized as follows:1. As the data intensive computing springing up, for large volume of the XML data, the cluster-based structural similarity search of the XML data is studied on cluster platform Hadoop, we implement the LSH index using MapReduce for the structural similarity search of XML data. The structural information of the XML trees are extracted using the pq-gram technique and indexed in the parallel way with locality sensitive hashing index. Experiments on the real datasets demonstrate that high performance and good scalability are obtained with our method.2. For the string similarity join problem on the string data, we analyze the existing work and propose the Hash"'-Join under string edit distance constraint with locality sensitive hashing technique. Firstly, the strings are divided into groups, and the string joins are conducted in each group string. Then the Trie-Join technique is applied for the similarity join. In addition, our method can be easily extended to the cluster platform for large scale string datasets. We also conduct the experiments on the real datasets, and the experimental results demonstrate that our method is effective and more scalable than existing work.3. In term of the similarity search of the high-dimensional data on the web, we study random projection technique of the locality sensitive hashing, and prove that the binary code after random projection satisfying the entropy maximizing criterion needed by the semantic hashing. And then the binary code is used as the class labels of the data objects. Based on the observations, we propose the learning-based framework and algorithms for the similarity search of the high-dimensional data. Under this framework, the approximate k-nearest neighbor search and the c-approximate nearest neighbor search are studied. Our method not only reduces the preprocessing time and space, but also improves the search efficiency.In sum, according to the characteristics of the similarity of the web data, the lo-cality sensitive hashing technique is used for the similarity search problems of the Web data. Our main work focus on:the structural similarity of large scale of XML data using MapReduce is implemented with locality sensitive hashing technique; On the basis of the Min-Hashing locality sensitive hashing technique, the string similarity join under the edit distance constraint is proposed; Based on the random projection learning framework, the similarity search algorithm is proposed for the high-dimensional data. Moreover, we conduct experiments with real datasets to verify the effectiveness and efficiency of the algorithms.
Keywords/Search Tags:Data Intensive Computing, Locality Sensitive Hashing, MapReduce, eXten-sible Markup Language, String Similarity Join, Approximate k-Nearest Neighbor Query, c-Approximate Nearest Neighbor Query
PDF Full Text Request
Related items