Font Size: a A A

Research And Improvement The Algorithm Of Mining Frequent Item Sets In Text Association Analysis

Posted on:2009-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:F HaoFull Text:PDF
GTID:2178360245465690Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the information era, a great deal of data are brought to us, it already become info technique field's hot issue that how to help people collect and select interested information, and discovery underlying, useful knowledge in increasingly information. In this situation, data mining and knowledge discovery in databases emerge as the times require. Text association analysis that finds connection between different words from document marshal is a important task in the area of text mining. Majority methods use the association rule of normal data mining field.First, this paper researches the characteristics of text association analysis which based on keywords, it's just like the conventional association analysis. We can regard text as affair, keywords as items, thus the keywords association analysis of text database transform the normal database association analysis. But because of the high dimension and sparsity character, using the same min support threshold on different text database will lead frequent item's size having huge discrepancy. So enactment support threshold become a difficulty of text association analysis.Second, this paper researches the algorithm of mining N most frequent item sets-IntvMatrix. This algorithm use the strategy of dynamic adjusting support threshold, thereby we can control the dimensions of frequent item sets by inputting the number N. It' s defect is that structure inverse matrix can bring on space wasting, and building affiliation between items needs scan database many times, it will bring on the waste of time.Third, aiming at the problems of IntvMatrix, this paper advance a kind of algorithm which called mining N most frequent item sets based improved FP-Tree, it arrange the order of items and the whole database, meanwhile delete the non frequent items, thus it can reduce the time when we search share prefixal, then it construct the COFI-Tree of local frequent items based on FP-Tree. This algorithm still use the strategy of dynamic adjusting support threshold, it makes guarantee on technology of producing N most frequent item sets.Finally, we input the different number N of frequent item sets on the same text database, and compare the new algorithm with IntvMatrix. The results show that the new algorithm's time and space using quotiety are improved, because adopt ameliorative FP-Tree to structure local COFI-Tree, along with optimized data structure.
Keywords/Search Tags:data mining, text data mining, text association analysis, N most frequent item sets, FP-Tree, COFI-Tree
PDF Full Text Request
Related items