Study On Management Of Text Documents Based Content In Dataspace

Posted on:2011-12-24

Degree:Master

Type:Thesis

Country:China

Candidate:D B Liu

Full Text:PDF

GTID:2248330395458438

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the computer is becoming popular, so personal data is expanding rapidly and Web is becoming a huge information-sharing platform, then data management presents some new features:rapid growth, information sharing, diversification of resources, heterogeneous distribution. Dataspace is a new abstraction for information management aiming at the challenges of traditional database technology.The document that contains much semi-structured and unstructured information is one of the most frequently-used objects. If we take inner information of documents as a data resource that can help users query and browse text information, dataspace function will be enhanced. However, currently existed dataspace management systems generally neglect the rich inner information. Thus, in this paper, we introduce two clustering algorithms based on content to manage text information and organize documents. Firstly, document wrappers extract much inner information which is divided into schema information and feature information. Then clustering algorithms cluster document with inner information. In the clustering based on schema information, an algorithm called term frequency matrix is introduced to select schema terms. After the schema terms are represented by vectors, documents are clustered by the SOM algorithm which has been optimized to reduce training times. In the clustering based on feature information, we introduce an algorithm called FTTC that is inspired by the FP-growth algorithm. Firstly, we construct a tree by the importance of frequent terms. Secondly, we traverse the tree and visit nodes when its document number is greater than the minimum support. In the process of visit, two operations merging and moving-up are executed, which put documents into clusters that are represented by some frequent terms. Thus dataspace users can browse documents and query inner information by associations among document clusters.The experiments mainly verify our algorithms on precision, recall, F-measure and the influence of parameters.

Keywords/Search Tags:

dataspace, data mining, text clustering, frequent itemsets, text document

PDF Full Text Request

Related items

1	Text Classification Using Sentential Frequent Itemsets
2	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System
3	Text Clustering Method Based On Frequent Itemsets
4	Issues In TCM Text Mining
5	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application
6	The Research And Implementation Of Mining Frequent Itemsets Algorithm Over Streaming Data
7	Research On Algorithm For Mining Frequent Itemsets Of Uncertain Data
8	FP-Tree Based Mining Frequent Itemsets Over Data Streams
9	The Research And Implementation Of Massive Short Message Mining Technology
10	Research On Algorithms For Mining Maximal Frequent Itemsets