| There is large amount of underlying information in email data: the email communication network can be mapped to social communication network and the communication of email can reflect the relationship between people indirectly; the contents of some certain email communication can reflect the concerned interests and topics of the certain mailbox user. Making good use of the information contained in emails to do research on the important node discovery and community detection on email communication network are highly concerned issues in the field of email data mining.This thesis makes a study on network construction, topic detection, important node discovery and community detection and so on. The main work and innovations are as follows:1. Email Communication Network Attribute Description. After the analysis of email data, constructs the email communication network by extracting the communication between the communication entities. Besides, make a relative attribute description of email communication. In this thesis, we propose an improved K-means algorithm of automatically detecting topics by clustering emails after constructing the improved email VSM (vector space model) to label the email combining the body and subject. Adopt the advanced k-means algorithm to obtain the topic attribute of email nodes. The effect of the F-measure of the improved K-means has increase 16.2% comparing with the standard K-means.2. Important Nodes Discovery. For the particularity of the email data, this thesis proposes two new importance measures: advanced clustering coefficient and EmailRank based on election. Considering the unilateralism of the degree, adopt a synthesized evaluation theory to discover important nodes. Experiments on the Enron dataset shows that the synthesized evaluation measurement works better at discovering important nodes than the single measurement evaluation and the method based on the graph entropy.3. Extract the Subnet Taking Important Nodes as Kernels. Propose four kinds of subnet extract algorithm based on the network structure. Through the analysis and the experiments, we can conclude that the method based on edge weight can obtain closer subnet structure combining taking account of the closeness of communication between email nodes and the hierarchical level.4. Subnet Community Detection based on the Email Content. Propose an algorithm of detecting community by clustering the edges content of the subnet which has close structure, then we validate the effect of the algorithm on the manual work labeled dataset, and we also do the experiment on the Enron dataset to get the close community both on the structure and on the content.5. Construct the Email Communication Network Mining Prototype System. Design and implement the modules of the important nodes discovery, subnet extract and community detection.Finally, conclude this paper and present further research aspects of analysis and mining on the email communication network technologies. |