Font Size: a A A

Research And Implementation Of Retrieval System On Massive Mail

Posted on:2009-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:X ShiFull Text:PDF
GTID:2178360278464410Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the development of computer and network, e-mail, as an important application of Internet, is welcomed by the people with its convenience and the rapid speed. The individuals, enterprises, government and even the military, are communicating via e-mail for daily life and work. However, the illegal businesses and lawless elements using e-mail push ads, viruses, unhealthy and undermine national stability information, makes potential safety hazard for the individuals, enterprises and the nationality. The mail filtering is the mature technology to filter the spam, but it can't prevent the propagation of the negative information. It has become a research direction how to retrieval the sensitive information in the massive documents and trace the suspicious information and users. So there is a urgent need to manage and monitor the massive mail safely.This paper analyzes the characteristics and special format of the mail, and the retrieval system of massive mail. The text of mail content, from, to which the user is interested in can be easily searched by the system, so it can solve the monitoring of mail message effectively. In order to improving the efficiency of processing mass-mail, the distributed mail parse, indexing and searching are mainly studied. Firstly, after introducing the theory of the special mail document, this paper analyzes the mail format and proposes a VSM (Vector Space Model) for the mail document. Secondly, the traditional inverted index document is used store the indices. The incremental index, which is different from normal retrieval system, is implemented in this retrieval system. This method saves index update time highly. For speeding up processing the massive mail data, the distributed processing technology is adopted in the system architecture. When pre-process the mail, the distributed processing technology makes one task run on several nodes by implementing the distributed algorithm, leading to the high speed of parsing and indexing. It also makes the search process stable and rapidly. Finally this paper describes the data test, analyzing the parsing and indexing speed between monolithic and the distributed system. Also it draws a conclusion that the search time is depending on the mail scale and the complexity of the query.The series of user operations such as parsing mail, indexing, searching are implemented in this system, combined with the distributed parallel technology. The system uses the invert index to store and manage the mail indices. And in order to meet the demands of the user's query, the similarity is computed by the mail VSM. At the same time, a good computing capability and application development environment is supplied for the system's unified interface and method.
Keywords/Search Tags:mass-mail, Distributed processing, information retrieval, index
PDF Full Text Request
Related items