Font Size: a A A

The Research And Implementation Of Technologies For Analyzing Internet Text Based On Hadoop

Posted on:2015-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhouFull Text:PDF
GTID:2268330428967779Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Due to the rapid development of mobile devices and the Internet network, the information on the Internet growing exponentially. Two key problems in dealing with Big Data is huge amounts of data storage and computing, the traditional text processing system in these two aspects can’t meet the needs of mass network text analysis. How to achieve efficient and real-time access to the network information, solves the massive data storage and computing is a common concern in academia and industry at present, therefore, the study of the problem is of great significance.In the face of huge amounts of data storage and computing problems, cloud computing and big data processing technology from concept is put forward to put into application, which provides a new direction for the network text analysis. Some open source framework are proposed unceasingly, the most popular one is the Hadoop platform at present, the lowest layer use the Hadoop distributed file system in order to realize the huge amounts of data storage, use MapReduce programming framework to realize parallel computing of large data, and using HBase column database to achieve mass structured data storage. In this platform, developers don’t need pay much attention to parallelize the implementation details, therefore, so as to can put more attention on the function itself.This paper focuses on the related work of Web text analysis based on Hadoop platform, the content of our work include analysis of network text acquisition, inverted index establishment and text clustering, the work is summarized as follows:First, this paper puts forward data acquisition scheme for network data based on Hadoop platform, the system is composed of four modules, the functions of them are respectively, crawl the URL in the web page data, analysis data, remove duplicate URL, and extract the useful information in the data page. Detailed implementation methods are given in this paper, including the logic flow chart of each function module, data storage structure of each module. Finally, this paper experiment operation results show that the network data acquisition method based on Hadoop platform compared to stand-alone systems on efficiency has greatly improved.Second, this paper proposes a scheme to establish the inverted index under Hadoop platform, in order to make the Lucene compatible with Hadoop platform, First, Lucene storage function is extended, in this way Lucene can support HDFS file system to read and write. Second, to establish the index function based on MapReduce framework is composed of two modules, one is a implementation of Chinese word segmentation parallelization, another completed the realization of inverted index parallelization. Finally, through testing, the system can under Mapduce parallel indexing, and in the form of standard size blocks of data stored in the HDFS.Third, implement the text clustering algorithm based on Hadoop platform. This paper gives the detailed implementation steps of parallel K-Means algorithm, including the logical flow chart of function modules, and data storage structure of each modules. Meanwhile Simulation experiment is given, and the experimental results show that the K-Means clustering algorithm based on Hadoop can efficiently deal with massive amounts of text than stand-alone systems.
Keywords/Search Tags:Hadoop, data acquisition, inverted index, parallel K-Means
PDF Full Text Request
Related items