The Research And Implementation Of Technologies For Analyzing Internet Text Based On Hadoop

Posted on:2015-02-21

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhou

Full Text:PDF

GTID:2268330428967779

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Due to the rapid development of mobile devices and the Internet network, the information on the Internet growing exponentially. Two key problems in dealing with Big Data is huge amounts of data storage and computing, the traditional text processing system in these two aspects can’t meet the needs of mass network text analysis. How to achieve efficient and real-time access to the network information, solves the massive data storage and computing is a common concern in academia and industry at present, therefore, the study of the problem is of great significance.In the face of huge amounts of data storage and computing problems, cloud computing and big data processing technology from concept is put forward to put into application, which provides a new direction for the network text analysis. Some open source framework are proposed unceasingly, the most popular one is the Hadoop platform at present, the lowest layer use the Hadoop distributed file system in order to realize the huge amounts of data storage, use MapReduce programming framework to realize parallel computing of large data, and using HBase column database to achieve mass structured data storage. In this platform, developers don’t need pay much attention to parallelize the implementation details, therefore, so as to can put more attention on the function itself.This paper focuses on the related work of Web text analysis based on Hadoop platform, the content of our work include analysis of network text acquisition, inverted index establishment and text clustering, the work is summarized as follows:First, this paper puts forward data acquisition scheme for network data based on Hadoop platform, the system is composed of four modules, the functions of them are respectively, crawl the URL in the web page data, analysis data, remove duplicate URL, and extract the useful information in the data page. Detailed implementation methods are given in this paper, including the logic flow chart of each function module, data storage structure of each module. Finally, this paper experiment operation results show that the network data acquisition method based on Hadoop platform compared to stand-alone systems on efficiency has greatly improved.Second, this paper proposes a scheme to establish the inverted index under Hadoop platform, in order to make the Lucene compatible with Hadoop platform, First, Lucene storage function is extended, in this way Lucene can support HDFS file system to read and write. Second, to establish the index function based on MapReduce framework is composed of two modules, one is a implementation of Chinese word segmentation parallelization, another completed the realization of inverted index parallelization. Finally, through testing, the system can under Mapduce parallel indexing, and in the form of standard size blocks of data stored in the HDFS.Third, implement the text clustering algorithm based on Hadoop platform. This paper gives the detailed implementation steps of parallel K-Means algorithm, including the logical flow chart of function modules, and data storage structure of each modules. Meanwhile Simulation experiment is given, and the experimental results show that the K-Means clustering algorithm based on Hadoop can efficiently deal with massive amounts of text than stand-alone systems.

Keywords/Search Tags:

Hadoop, data acquisition, inverted index, parallel K-Means

PDF Full Text Request

Related items

1	Research On Key Technologies Of Full-text Index Compression In Cloud Environment
2	Study And Implementation Of Inverted Index On Hadoop
3	Design And Implementation Of Multi-Keyword Parallel Ciphertext Retrieval System Based On Inverted Index
4	Study On Hadoop-based Inverted Index
5	Design And Implementation Of Distributed Index And Search System Based On Cloud Platform
6	Research And Implementation Of Parallel Index For Space Information System
7	Parallel Search On Ciphertext Based On Index In Cloud Computing
8	Data Index Technology Research Based On Parallel Computing Platform
9	Space- And Time-efficient Compression And Intersection Algorithms For Inverted Index
10	Research On Inverted List Parallel Query Method Based On Dataspaces