Font Size: a A A

Based On The HDFS Unstructured Data Retrieval Technology Research And Application

Posted on:2017-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ChangFull Text:PDF
GTID:2308330482480646Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the deepening of the degree of information technology and the popularization of computer technology in the general public, the amount of data created by people that is amazing in every day, and these large data forms are diverse. Unstructured data occupy most of the proportion, and the information which most people require in daily work is contained in these unstructured data. Full-text retrieval is a powerful tool to deal with unstructured data, which provides a convenient way to help users to get daily information from data resources.For full-text search, the author of this paper is mainly for researching in two aspects. On the one hand is index-building and update strategy. On the another hand is sorting of the retrieved results. Inverted index is a kind of efficient way to construction of unstructured data indices, which is also the core of the full text search. The speed of index update affects the efficiency of retrieval. Based on this, the ranking algorithms of retrieval results are studied in the paper, the final return of retrieval results conform to the user’s expectations and meet user needs. In this paper, we will in-depth research the typical index update strategy and PageRank sorting algorithms, and analyze the shortage of their application and improve PageRank sorting algorithm. We propose a merger of update strategy based on DHT.The main work of this paper is as follows:(1) according to the characteristics of the unstructured data, this paper makes a deep research on the data about reading and writing mechanism of HDFS, and understand the process which is the model of MapReduce creates inverted-index structure and analyze the task of scheduling and operation process of MapReduce model.(2) the inverted index of combined update algorithm is proposed based on DHT. This algorithm not only meets needs of the real-time dynamic document at the same time, but also effectively reduces the index combined cost in the way which adopts the multiplex merging. Using the parameter dynamically adjusts the balance of the between efficiency and retrieval performance in inverted-index update. By building retrieval platform experiment on HDFS cluster, shows that this algorithm has certain advantage.(3) put forward the improved VSMT-PageRank ranking algorithm. The time factor and the similarity calculation is introduced in this algorithm which is on the basis of PageRank. In this way effectively solve the lay particular stress on old pages and search results lack of timeless of the problem. It also can overcome the defect that is the topic drift of the traditional sort algorithm, so as to improve the user satisfaction of the results and meet the demand of the user’s retrieval. Through building retrieval platform on HDFS cluster and fetching data sets from Sina in order to contrast experiment. Last, the experiment verified the effectiveness of the proposed algorithm.
Keywords/Search Tags:unstructured data, HDFS system, inverted index update, results sorting
PDF Full Text Request
Related items