Font Size: a A A

Study On Management Of Text Documents Based Content In Dataspace

Posted on:2011-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:D B LiuFull Text:PDF
GTID:2248330395458438Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the computer is becoming popular, so personal data is expanding rapidly and Web is becoming a huge information-sharing platform, then data management presents some new features:rapid growth, information sharing, diversification of resources, heterogeneous distribution. Dataspace is a new abstraction for information management aiming at the challenges of traditional database technology.The document that contains much semi-structured and unstructured information is one of the most frequently-used objects. If we take inner information of documents as a data resource that can help users query and browse text information, dataspace function will be enhanced. However, currently existed dataspace management systems generally neglect the rich inner information. Thus, in this paper, we introduce two clustering algorithms based on content to manage text information and organize documents. Firstly, document wrappers extract much inner information which is divided into schema information and feature information. Then clustering algorithms cluster document with inner information. In the clustering based on schema information, an algorithm called term frequency matrix is introduced to select schema terms. After the schema terms are represented by vectors, documents are clustered by the SOM algorithm which has been optimized to reduce training times. In the clustering based on feature information, we introduce an algorithm called FTTC that is inspired by the FP-growth algorithm. Firstly, we construct a tree by the importance of frequent terms. Secondly, we traverse the tree and visit nodes when its document number is greater than the minimum support. In the process of visit, two operations merging and moving-up are executed, which put documents into clusters that are represented by some frequent terms. Thus dataspace users can browse documents and query inner information by associations among document clusters.The experiments mainly verify our algorithms on precision, recall, F-measure and the influence of parameters.
Keywords/Search Tags:dataspace, data mining, text clustering, frequent itemsets, text document
PDF Full Text Request
Related items