Font Size: a A A

Methods For Phrase-based Text Mining And Analysis

Posted on:2019-12-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LiFull Text:PDF
GTID:1488306344958969Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With development of Internet technology,information are generated at an un-precedented scale.The tremendous volume and high velocity of data cause "infor-mation overload" problem that exceeds the capability of either people or traditional data management and analysis methods to understand,process and effectively utilize information.A great portion of the overwhelming information are unstructured text data,including social media texts,web pages,news articles,academic papers,etc.Therefore,text-based mining and analysis techniques,as a powerful methodology to automatically discover knowledge from massive corpora,have attracted extensive research attentions for both academia and industry,and become a hot-topic and core research issue of Natural Language Processing and Data Mining communities in this big data era.A phrase is a natural,less ambiguous,and meaningful semantic unit.To investi-gate phrase-level text mining and analysis techniques is of high value to enhance the power and efficiency to facilitate human to explore and understand unstructured text data.Unfortunately,existing methods often suffer from low phrase quality,low topical cohesion,low adaptability to different compositionality of phrases,and low scalability outside small corpus.Therefore,the challenge is how to address the shortages of existing methods to improve the capability and efficiency in text mining and analysis.In order to address the above issues,we intensively study phrase mining,top-ical phrase mining,and phrase embedding methods,which could well support text mining and text-based analysis.To be specific,the main contributions of this paper are list as follows:Firstly,we propose an efficient quality phrase mining method by making the first effort to consider order-sensitive problem in mining phrases.The proposed method could eliminate order-sensitive to improve quality of mined phrases.Considering the high computational cost in mining complete phrases,we propose a dynamic programming based method,a chunk-based method,and a seed extension based method to lower down the computational complexity,thus,the scalability over mas-sive corpora could be greatly improved.Moreover,to improve efficiency of frequency counting and retrieval,we propose a novel PhraseTrie structure,which could share common prefix to achieve the better efficiency than conventional data structures.The experiments demonstrate that,compared with the best state-of-the-art method,our method is 3?18.7 times faster.Secondly,considering the false overlapping phrase segmentation problem in mining topical phrases,we propose an overlapping phrases segmentation algorithm which takes both the intra-cooccurrence of phrases and the isolation of split posi-tion into consideration.Besides,a parameter estimation and dynamic programming based segmentation strategy are proposed to reduce computational complexity.To address the false topic assignment of constituent word,we propose a novel topic mod-el CPhrLDA based on "bag-of-phrase" assumption,where the topic of a constituent word could be assigned in a more flexible way.Further,we propose a density peak based k-means clustering method along with an iterative scheme to facilitate finding domain-specific phrases.The experiments demonstrate that our method preforms 12%better in topical cohesion than the best state-of-the-art method.Thirdly,we propose a hierarchical compositional model to support phrase em-bedding with various compositionality,including the hybrid compositionality that widely exists while ignoring by existing methods.In this model,a phrase's compo-sitions could be implicitly expressed by aggregating the compositionality along the paths in the hierarchical structure.Thus,the model complexity could be greatly reduced.Also,we propose an EM-based framework to infer the internal structure and learn model parameters.For the former part,a dynamic programming based approach is proposed to help with efficiency.For the later part,it could use the learned compositionality to update hierarchical structure and embedding accord-ingly.The experiments demonstrate that,compared with the best state-of-the-art method,on average,our method was 5.5%better in analogical reasoning tasks and was 1.8%better in phrase similarity task.In summary,we propose a set of data-driven,highly scalable,phrase-based text mining and semantic learning approaches,including an efficient high quality phrases mining method,a high cohesive topical phrase mining method,and an adaptive hierarchical compositional model for phrase embedding.The experiments are conducted on various datasets and many text mining and analysis tasks,the experimental results demonstrate the proposed methods outperform state-of-the-art methods in both effectiveness and efficiency.
Keywords/Search Tags:text mining, natural language processing, phrase mining, topical phrase mining, topic model, phrase embedding
PDF Full Text Request
Related items