Font Size: a A A

Research On Forum Information Extraction And Storage Based On Cloud-Based MongoDB

Posted on:2013-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:M M ChenFull Text:PDF
GTID:2248330392952043Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology, and the popularization of all kinds ofinput terminals, like mobile phone, Tablet PC, Smart TV, etc, let the Internet data present aexplosive growth. Facing massive data processing problems, how to store the massive data ina more stable and rapid way, and mine valuable information from it, become new challengesof many companies. The emergence of cloud storage brings new opportunities for datamining technology. Giants, like Amazon, Microsoft, Google, IBM and so on, have launchedtheir own cloud storage platforms, at the some time some big companies in China, like Baidu,Huawei, Tencent,360, etc also follow this. This paper takes massive forums data as thesample to build an experiment system, which supports level extension. Then, this paperdesigns and realizes some methods, used for the forums’ data extraction. At last, verifies someperformance advantages brought by cloud storage. The main contents of the paper are asfollows:1)The paper introduces NOSQL databases in details, which were driven by cloudstorage’s development, and expounds the characteristics of all kinds of NOSQL. According toforums’ feature, chooses MongoDB, one of NOSQL databases, as the storage database. Aftercomparing MongoDB with traditional database, summarizes some MongoDB’s advantages,then tells users how to use the MongoDB to store forums’ data.2)Describes some extraction methods about forums’ information. After analyzing thefeatures of native forums and their structures, the paper divides forums into two categoriesaccording to their feature: common forum and specialized forum, for common forum, to use regular expression to extract accurate data; for specialized forum, to design a heuristicmethod for data extraction. Extracting forums’ data by different way, improves extractionaccuracy rate.3)In order to validate the new storage way, as well as the feasibility of informationextraction algorithms, this paper designs a information extraction experiment system based inMongoDB’s distributed storage, it can support level extension, store massive data stably, andmine all kinds of useful data from forums accurately. When the data grows to a certain size, totest the performance of large data storage, make some comparisons in several query cases. Atlast, the experiment gets a conclusion that the distributed cloud storage, in dealing with largedata, has an overwhelming advantage than MongoDB with single server architecture.4)In the end of this thesis,to summarize the whole work and discuss the existingproblems and future job.
Keywords/Search Tags:Cloud computing, NOSQL, MongoDB, Search Engine, InformationExtraction
PDF Full Text Request
Related items