Font Size: a A A

The Research Of Distributed Text-based Data Filtering Technology And System Implementation Based On MapReduce

Posted on:2012-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2218330362960376Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet has brought explosive growth of the amount of information, how to obtain useful and accurate information among the mass of information becomes urgent and necessary. Based on the needs of users, information filtering technology can filter out information which does not meet the requirements of users from the dynamic information flow. At the same time, it can find out useful information in an effective manner. Since the traditional method has difficult in meeting the needs, distributed computing platform is the inevitable trend of future.Content-based text data filtering use the Vector Space Model to represent text. By calculating the Cosine of the angle between text and user interest template, we can determine the relevance of the text. This method has mature theory, high precision, and can be easily understood.MapReduce is a data model framework which can make massive data be processed in parallel on a large cluster of computers. To complete most of the distributed computing tasks, users only need to customize the map function and reduce functions. Many of the real world problems can be easily represented by the MapReduce model.In this paper, we firstly studied content-based text data filtering model. As existing text filtering system cannot always meet our needs, in this paper, we focus on studying key technologies under the MapReduce framework. Main work is as follows.(1) Studied relevant theories and technologies of the content-based information filtering. Analyzed in-depth the advantages and disadvantages of these technologies.(2) Analyzed the MapReduce framework and its related components in-depth. With examples, explained in details on how to develop application on MapReduce.(3) Designed a feature extension model based on HowNet. The model reduced the dimension of the vector by merging those items that have the same meaning.(4) Proposed an algorithm to calculate TF-IDF based on MapReduce framework, achieved parallel computing of the task by the decomposition of the task.(5) Designed and implemented a distributed text data filtering prototype system based on MapReduce framework. Experiments show that the method is feasible.
Keywords/Search Tags:Text filtering, MapReduce, Distributed Computing, Vector Space Model, Feature extension
PDF Full Text Request
Related items