Font Size: a A A

The Research And Implementation Of Massive Short Message Mining Technology

Posted on:2007-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H WangFull Text:PDF
GTID:1118360215970558Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet and communication technology, huge data is accumulated. Short documents such as paper abstracts and conversations in chatting rooms are common in such data. It is very useful to analysis and mine the short documents to get valuable implicit knowledge. However, unlike in common documents, key words in short documents appear with a low frequency which makes traditional word frequency based mining technology can not get acceptable accuracy when mining short documents. On the other hand, when processing text data with hundreds of GB or even larger than 1 TB, most of the existing mining algorithms become inefficient or even unavailable.Based on the analysis of the current status and challenges of short documents mining technology, the thesis aims to develop short documents mining algorithms with high accuracy and scalability. Short document mining technologies such as frequent term sets mining, classification and clustering, etc., are studied in the thesis. Semantic information in the short documents is used in order to get better accuracy for the mining algorithms. In order to improve the performance and scalability, parallel mining methods are used.The main contributions of the thesis are as follows:1. Aiming at the challenges on frequent term sets mining in very large short text databases, we present a parallel top-k frequent term sets mining algorithm named parTFT. A novel logical vertical data partitioning method is used to make sure the top-k frequent term sets can be mined parallel at each mining node. On the other hand, heuristic methods are used to prune the header table of H-struct at each mining node which improves the performace of the algorithm. Experimental studies show that parTFT has better performance and scalability than similar algorithms when mining very large short text databases. The paper for parTFT algorithm is published in the proceeding of the Sixth International Conference on Web-Age Information Management (WAIM 2005) and the SCI index number is BDG49.2. In order to improve the accuracy when classifying short documents, we present a semantic based short documents classification algorithm named SDCS. SDCS uses a novel symantic features graph to represent semantic information and uses KNN method to classify short documents. Experimental studies show that SDCS has better accuracy and performance than similar algorithms when classifying massive short documents. The paper for SDCS has been submitted to the Journal of Computer Research and Development.3. Based on the analysis of the challenges on massive short documents clustering technology, we present two algorithms named FTSDC and DSDC. FTSDC is a frequent term sets based clustering algorithm. It first partitions the documents into clusters according to the frequent term sets and then optimizes the clustering using semantic information. DSDC is a density based clustering algorithm. It uses semantic information to calculate the distance between documents and clusters the documents based on SNN graph. Data sampling and SNN graph partition technology is also used to cluster documents parallel. Experimental studies show that the two algorithms bothhave better accuracy and performance than similar algorithms when classifying massive short documents. The paper for FTSDC algorithm is published in Proceeding of the WISE Workshop on Web-Based Massive Data Processing (WMDP2006). The paper for DSDC algorithm is submitted to the Journal of Software.4. In order to improve the accuracy of mining methods further and manage the semantic information in a reasonable way, we define domain ontology for short documents and present the domain ontology building method. Based on domain ontology, we present frequent concept sets based short documents clustering algorithm named OFSDC and density based clustering algorithm named ODSDC. Experimental studies show that ontology based methods can utilize semantic information better and get better accuracy. The paper for OFSDC is published in Proceeding of the VLDB Workshop on Ontologies-based techniques for DataBases and Information Systems 2006 (ODBIS'06).5. Based on the studies on the architecture of parallel data mining, we present a parallel mining architecture for massive short documents based on CORBA and implement the massive short documents miner in the very large transactions processing middleware StarTPMonitor.
Keywords/Search Tags:Massive Data, Short Message, Text Mining, Text Classification, Text Clustering, Frequent Term Set, Semantic, Ontology, Parallel Data Mining
PDF Full Text Request
Related items