Font Size: a A A

Design Of Real Time Large Data Processing System Based On TF-IDF Improved Computation Model

Posted on:2018-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:H M WangFull Text:PDF
GTID:2348330542971926Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Real-time data processing system is mainly used for massive data processing,the use of its distributed characteristics,can provide high processing power and availability.How to evaluate the importance of a word for a file in a collection of files is a relatively critical issue in the field of text mining for big data processing systems.Word Frequency-Inverse Document Frequency(TF-IDF)is a commonly used weighting technique for text retrieval and text mining.It is widely used in search engine system.Currently based on large data processing system to achieve TF-IDF algorithm real-time computing open source industry has not yet mature solution,more is the commercial company to master,technology is not open.This thesis first introduces the related technology of large data system and its development status at home and abroad,and analyzes the large data analysis and processing flow and TF-IDF algorithm in detail.Then,a flow-based computing platform JStorm is designed and implemented.The TF-IDF low-delay algorithm based on JStorm computing platform is designed and implemented.The batch processing framework Spark is designed and implemented.The TF-IDF batch algorithm based on Spark platform is designed and implemented.The batch algorithm and the real-time calculation view integration algorithm are proposed,and the batch and real-time computing fusion architecture is constructed,which improves the precision and real-time performance.The research results of the paper have been applied in practice.The research results of this paper have solved the problem of implementing TF-IDF algorithm on big data real-time processing system.The basic realization of low delay and accuracy has reached the expected level,satisfying the search engine,text similarity calculation,emotion analysis,Text abstract,hot word calculation and other application scenarios in the mass data production environment using TF-IDF algorithm.
Keywords/Search Tags:TF-IDF, Real-Time System, Spark, JStorm, big data, low-latency
PDF Full Text Request
Related items