Font Size: a A A

Research On The Application Of Topic Analysis Method Based On LDA Model In Korean Big Data Text Mining

Posted on:2021-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiFull Text:PDF
GTID:2518306026471074Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
Nowadays,with the rapid popularization and development of technologies such as the Internet and multimedia,the information age has already arrived,and the information and data we have come into contact with have become more and more,so that the way people acquire data knowledge has become easier.The resulting situation is that the ability requirements for information and data processing have also become higher.Among them,text data is the most basic and most accessible data type of all types of data,and it is also one of the most important data types.There is also a relatively large amount of analysis and processing of this type of data.And how to deal with large-scale big data texts in an efficient and accurate manner is a problem that needs attention.Big data represents more than just the number and size.The most important thing is the original data.Contains a large amount of core data that is effective,meaningful,and researchable.It is the core of turning a large amount of raw data into valuable data that can express its core content.To this end,you need to use data mining methods and perform mining analysis on text-type data,that is,text mining.In order to perform efficient analysis of large-scale text data,a distributed big data platform has been set up and some of its functions have been optimized,and related technologies for text mining including clustering algorithms,TF-IDF methods,topic models,etc.Visual analysis through R language,using analysis method based on topic model to improve the analysis efficiency of text topic mining,and get better analysis results.In this research,the Korean texts are mainly used as data,which is also one of the core contents of this study.Using its own inherent language advantages,using Korean text data which is difficult to obtain,sorted and summarized,and added some Chinese data for bilingual text mining analysis.Due to the various policies of the Korean side,the understanding of various types of related information and content is low.This article collects text data related to the Korean text and uses related statistical methods to analyze and process the text data.The three major difficulties of this paper are the first is to improve the preprocessing effect of the data and the data platform suitable for big data analysis.The second is the accurate analysis and processing of Korean texts.The third is how to optimize the analysis efficiency of text data based on the traditional algorithm.In order to solve the above problems,a text searcher with precise matching function was first specially designed for text preprocessing,which improved the efficiency of text preprocessing and built a distributed big data analysis platform suitable for large-scale data analysis.After that,a Korean analysis system suitable for Korean languages is used,and various function packages for Korean languages are used in R language to achieve accurate analysis of Korean texts.Finally,the LDA model is used to improved the disadvantages of traditional clustering methods,accurately analyze the topic of text data and improve the analysis efficiency of text data.Through practical application analysiskthe results show that the cluster mining analysis based on the LDA topic model is significantly better than the traditional text mining analysis.
Keywords/Search Tags:Korean text mining, Distributed big data platform, K-Means algorithm, LDA topic analysis
PDF Full Text Request
Related items