Font Size: a A A

Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework

Posted on:2019-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Muhammad Qasim MemonFull Text:PDF
GTID:1368330593950005Subject:Software Engineering
Abstract/Summary:PDF Full Text Request
This thesis investigates the Latent Dirichlet Allocation(LDA)based segmentation using improvised sub-document based framework for efficient document clustering in comparison to traditional clustering approaches such as segment based clustering.Document clustering,which is used for topic discovery and similarity computation,has been a major concern when it comes to text data management.The methods adopted in traditional clustering particularly for multitopic documents are not viable enough due to the contents distinguished by the sub topical structure that are not pertinent across documents.In the course of this research,existing traditional approaches treat a text document as a single text representation and in similarity calculation,which is not justified for multi-topic documents.The proposed improvised framework is a two-way approach to this concern.One,instead of applying clustering algorithms to whole data set,documents are partitioned into cohesive subdocuments depending upon topic boundaries using LDA segmentation method to impart twolevel representation of text data(topics and words).Two,proposed clustering technique is compared with existing clustering methods(traditional and segment based)to cluster multi-topic documents using clustering algorithms such as Spherical k-Mean(Sk-Means),Overlapping SkMean(OSk-Means)and LDA in chapter 3.This thesis also presents sub-documents that are further clustered into groups with each group taking the shape of a sub-document set that contains coherent groups of sub-documents in a large document.In addition,sub-documents set and the original documents are clustered in partition and hierarchical clustering respectively as discussed in chapter 4.Document segmentation measured using evaluation metric 6),which measures the rate of error value signifying segmentation accuracy.Clustering quality is measured using F-measure in terms of precision and recall as described in chapter 4.Based on evaluation selection model presented in chapter 4,clustering algorithms produce overlapping and non-overlapping clustering solutions.Moreover,experimental results of query processing for cluster matching are presented for time efficient data retrieval in chapter 5.Query optimization is a very complex task for commercial databases involved in cluster formation and matching.In order to improve performance,query processing aspires to be major factor for finding the better execution.I investigated the problem of SQL query optimization merely from the perspective of query response time in different databases using different queries such as join and complex queries.Query processing method adhering with respect to the underlying topics in order to tune(Select,complex and join SQL)queries with optimized execution plan using PL/SQL features by incorporating database objects such as procedures,triggers and methods to improve query performance for clustering formation.In multi-topic document clustering,traditional clustering methods adopted for multi-topic documents are not viable enough due to the contents distinguished by the sub topical structure that are not pertinent across documents,wherein,existing approaches treat a text document for a single text unit representation and similarity computation,which is not suitable for multi-topic documents.Thematic parts of documents are identified through boundaries known as segments are generated with the disadvantage of the words being repeated throughout the process using Text Tiling,whereby,segments are not related or labelled to any topic information.Clustering approaches for multi-topic documents assume each document as single text unit with multiple clusters assigned by documents which do not explicitly related to different topics.Existing methods have been found short on the multi-topics documents portraying below par results with no connection or lack thereof to the topics similarity to determine perspective domain.Develop such a document clustering approach,which evaluates that each document is explicitly related to different topics.Existing methods have been found short on the multi-topics documents portraying below par results with no connection or lack thereof to the topics similarity to determine perspective domain.Traditional clustering of multi-topic documents involves different approaches producing overlapping clustering solutions such as fuzzy clustering,clustering based on generative models and ensemble subspace clustering.These approaches are based on a method in which each cluster is considered as a single topic or piece of information with multiple clusters are assigned by each document,characterized by topic relevance.Text Tiling algorithm decomposes a text into different portions in the shape of contiguous blocks(passages and subtopics)to segment a document based on topic boundary,which is not efficient and robust as compared to LDA.Each block finds boundaries in documents corresponding to topics that include terms and words.Analyzing patterns of lexical co-occurrences and distribution in the text among contiguous blocks is measured using dot-product in the vector space.In LDA based method,there is no necessity of segmenting all pairs of adjacent blocks to identify the segments of the document.Expected experimental results using proposed framework based on two multi-topic data sets in agreement with different algorithms were compared with those obtained using existing methods such as multi-document segment based clustering and multi-document segment based clustering approach using Text Tiling algorithm.Document clustering is a useful technique,which organize a large collection of text into cohesive groups.Each group is associated to a cluster and labelled with relevant words and terms declaring associated documents.Conventional clustering approaches cannot accurately represent the associated document through semantic relationship among words.An inclusion of ontology-based document clustering could be useful to exploit the semantic relation between words in order to improve the clustering quality such as ontology-based general weighting schema framework and e-Learning domain specific ontology-based document.However,there still exist different issues,such as retrieving word semantics from texts,synonym and polysemy,appropriate declaration of clusters and high dimensionality.In order to remove these issues the integration of Word Net and lexical chain were attempted to generate clusters with accurate assessment of terms for word sense disambiguation.However,reference ontology in ontologybased document clustering could not represent and include all the terms,which is very challenging task in order to associate those terms to clusters because the terms are not present in the reference ontology.The above-mentioned clustering methods were mostly biased to cluster each document as a whole single text unit and found less effective to provide efficient and accurate clusters.Whereas,topic modeling and document segmentation methods that coincide document segmentation and document clustering could be converged based on our proposed subdocument based framework.Proposed clustering framework outperformed the existing approaches in terms of F-measure and time cost,which showed an average improvement in F-measure by 10.2% and 11.5% for Reuters Corpus Volume 1(RCV1)and 20 Newsgroups in experiment 1.In addition,highest macro F-measure of 0.791 with an average improvement of 10.2% was observed on RCV1 dataset as compared to an average improvement of 11.2% in case of 20 Newsgroups dataset,which contained much smaller sub-documents within a document.In terms of precision,our proposed clustering framework performed better than traditional document clustering methods with an average improvement of over 54%.Later,experiment 2 presents various real-time data set containing multi-topic documents are designed to perform a comprehensive presentation and validation of clustering algorithms through proposed sub-document based framework.Moreover,sub-document based framework could improve the performances over 73% in terms of Fmeasure using both LDA segmentation and bisecting LDA when compared to Text Tiling.Experimental results of document segmentation are produced using LDA and Text Tiling.Further,segmentation evaluation performances are presented based upon the evaluation metrics(chapter 6).Evaluation performances of sub-document based framework using cross clustering model were compared to no cross(within document)clustering and shown improved time and memory cost using LDA segmentation method and LDA clustering algorithm.These evaluations performances also transpired that LDA document segmentation outperformed Text Tiling by obtaining improved results for clustering solutions(disjoint and overlapping)in cross and within document clustering in different clustering methods in section 6.5.Further,experimental results of different clustering methods were investigated in the proposed framework,which yielded each representation model(such as sub-document,sub-document set and document)in cross and within document(no cross)fashion.These results were also compared to traditional clustering and segment based framework,and revealed better improvement in terms of F-measure(section 6.6).Moreover,performance evaluation of sub-document framework on different dataset are presented in experiment 1 and experiment 2 in section 6.7 and 6.8,respectively.The proposed framework could improve the clustering performance particularly for Bisecting k-Mean clustering algorithm.In addition,experiments results suggested the LDA segmentation outperformed the Text Tiling in terms of time cost,accuracy and memory cost.We performed statistical significance of achieved results of proposed sub-document based framework using LDA segmentation compared to Text Tiling.This significance test is assumed to use unequal variances due to the multiple representation of documents(sub-document,sub-document and document)rendered higher values.Further,unpaired T test is computed through the null hypothesis of no difference of achieved results.The significant feature of the proposed based framework emphasizes on topic modeling to improve the segmentation using LDA manipulated with clustering algorithms,where subdocuments are identified and extracted by computing the rate of error in segmentation(in terms of Pk)based on topics and words.The demonstration of LDA based segmentation is proposed via sub-document based clustering algorithm in this research based on domain data by training the topic models,which yielded the better performance than standard segmentation Text Tiling.Cluster matching through queries is performed in terms of query processing and optimization by proposed algorithm Add_Atribute procedure embedded into queries in order to eradicate all the poor SQL statements.Further,debugging the PL/SQL code making the better execution plan to optimize the queries based on time and memory cost rather using manual tuning relying only on automatic SQL tuning.The proposed sub-document based framework found to be accurate and efficient in terms of F-measure,time and memory cost,which outperformed segment based framework and traditional clustering on the multi-topics documents portraying above par results with connection to the topics similarity in order to determine perspective domain and hence proposed sub-document based framework is a significant and efficient approach for document clustering.
Keywords/Search Tags:Document Clustering, LDA, Clustering Algorithm, Topic Modeling, Information Retrieval, Query processing, Document Segmentation
PDF Full Text Request
Related items