Font Size: a A A

Approximate Lineage Extraction Techniques For Emails

Posted on:2012-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:J F YuFull Text:PDF
GTID:2248330395958259Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the progress of society and science, the amount of information data increases exponentially. When confront mass data, people always wants to know where the data comes from and how it comes. Data lineage describes the generation, and the whole process of propagate of the data. It is widely used, involving scientific computing, bio-engineering, database and other areas. However, the existing data lineage extraction technologies are not suitable for approximate data. This problem brings a certain limitations to the application of data lineage technique.The problem we studied in this thesis is the approximate data lineage extraction technique for emails. Email is the product of the information age and has been used widely in our daily life. To the users of email, when the number of mails is large and can not be clearly classified, a convenient and efficient method is needed to manage and query messages. Data lineage for emails can meet such requirement. It could not only classify emails with the same topic into one group, but also describe the derivative relations between these emails. However, to our best knowledge, there is no a mature lineage extraction technique to solve the above problem. In this thesis, we analyze the characteristics of email and propose techniques to support the approximate email lineage extracting effectively.We first define the concept of email-lineage relation, approximate email and approximate email-lineage relation. We analyze the characteristics of EML message data, bring forward an email message information extraction method. The algorithm can efficiently extract and decoding email data. We then propose an email lineage-relationship extraction algorithm base on the concept of email-lineage relation and the information we have extract form email. The algorithm can effectively extract potential derivative relations between emails. We propose index structure and query optimization to further improve the algorithms. After that we analyze the approximate lineage relationship between emails which may exist in the practical application, including query content approximate and topic approximate. We propose an approximate email lineage extraction algorithm base on q-gram index and clustering methods. Finally, we use a real dataset, called enron-email, and500random drawing personal emails to test the performance of the algorithms in this thesis. The experimental results show that the proposed algorithms can efficiently support the approximate email lineage extracting.
Keywords/Search Tags:data lineage, email, extraction, approximate, clustering
PDF Full Text Request
Related items