Font Size: a A A

Text Data Mining Based On Association Rules

Posted on:2007-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q X ZouFull Text:PDF
GTID:2208360185972974Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the information era, we have fast-growing, tremendous amount of data and it has been described as a data rich but information poor situation. In this situation, Data Mining and Knowledge Discovery in Databases are applied to fit it and grow more and more strongly and have been recognized as a new area for database research.Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It uses machine learning, statistical and visualization techniques to discovery and present knowledge in a form which is easily comprehensible to humans.Text Mining, which is nearly related to natural language, is the discovery by computer of new, previously unknown information by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new knowledge.Text mining is different from web search. In search, the user is typically looking for something that is already known and text mining is to find unknown knowledge which is not in existence yet. The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts.Association rule is the correlation of each item in a large database. The association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rule usually describes the relationship of a set of item, and is divided into Boolean Association rule and value Association rule.Association rules provide information of this type in the form of "if-then" statements. In addition to the antecedent (the X part) and the consequent (the Y part), an association rule has two numbers that express the degree of uncertainty about the rule. In association analysis the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common). The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent to the number of transactions that include all items in the antecedent.The most famous association rules algorithm is the Apriori algorithm, which is put forward by R. Agrawal. Apriori algorithm is used to find one dimension rules, one layer rules and Boolean rules. It is based on an obvious conclusion that as the itemsets k is a frequent set the itemsets k-1 is also a frequent sets. The problem of discovering association rules can be divided into two steps: 1. Find all frequent itemsets (sets of items appearing together in a transaction) whose support is greater than the specified threshold. 2. Generate association rules from the frequent itemsets. To do this, consider all partitions of the itemset into rule left-hand...
Keywords/Search Tags:Data Mining, Associate Rule, Text Mining, Associate Rule Algorithm
PDF Full Text Request
Related items