Font Size: a A A

Comparative Analysis Of Chinese And Foreign Journal Based On The Optimized LDA Topic Model

Posted on:2021-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z H JiangFull Text:PDF
GTID:2518306314453714Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,information explodes and a large amount of data is presented to people in various forms.It is no longer limited to mere Numbers.Text,pictures,videos and even a small action of people can be retained as data.These data contain a lot of information that can be used.How to extract these information from a lot of seemingly useless and meaningless data and maximize the benefits of information has become a problem that people are eager to solve.In the field of text,the most important thing for researchers is the literature.Most of the literature utilization is confined to the study of a single document or the descriptive analysis of the literature.The information is not fully explored and the connection between the literature is ignored.Chinese literature and English literature have the same important academic value for researchers,but their differences and similarities are not well known.The academic research focus of the same field in different countries plays an important role of guidance and reference for scholars in this field.In this case,it is of great theoretical value and practical significance to use text mining analysis to study the internal connection and differentiation between a large number of Chinese and foreign literatures.This paper will take the Chinese medical journal and the journal of the American medical association,the authoritative journals in the field of medicine in China and the United States,as examples to carry out descriptive analysis and improved thematic modeling analysis.Compared with the shortcomings of the traditional thematic model,this paper proposes a more suitable thematic model for literature,with the purpose of making an accurate comparative study of the two,digging into the hidden information behind them,showing the research hotspots and development trends in the medical field of China and the United States,and providing researchers with reference information.In this paper,the title,author,date and abstract of the literatures in Chinese medical journal and journal of the American medical association from 2010 to 2019 were crawled by the crawler method.In order to get the writing rules of Chinese and foreign journal authors,the annual output,the number of high-yielding authors and the co-authorizing rate of scientific research papers were analyzed.Then the literature abstract was preprocessed,and the Chinese literature was divided into words,to stop words,to remove a single word,etc.,and the English literature was divided into words,uppercase to lowercase,to stop words,to remove words with less than three letters,etc.The traditional LDA model is unable to complete high-precision modeling for highly specialized terms in the medical field.In this regard,the LDA model is improved in a small extent and TF-IDF tool is used in combination with it.The powerful screening ability of TF-IDF was used to select the high-frequency words of traditional Chinese medicine professionals in the literature,and then the LDA topic modeling with the same effect was conducted on them.Double screening ensured the high accuracy and high representativeness of words in each topic.After a descriptive analysis of Chinese and foreign journal literature,this paper concludes that compared with journal of the American medical association,the annual scientific research output of Chinese medical journal is relatively small,and the difference has been increasing year by year since 2014.The co-authorship rate of the two kinds of journals is generally increased,which indicates that the co-authorship of journals will become a trend.Compared with traditional LDA theme model,the LDA theme model combined with TF-IDF method has better accuracy and applicability.Based on the comparison of the analysis of the two Chinese and foreign journal literature models by the improved LDA thematic model,it can be found that the research directions of other subjects are different when the research subjects in the field of clinical medicine in China and foreign countries are partly the same.To increase the amount of academic discussion,communication and learning of the subjects such as emergency medicine with less papers published in China.The innovation of this paper lies in that in the data processing stage,due to the particularity of the specialized vocabulary in the medical field,the use of general word segmentation tools cannot bring accurate cutting,which will have a great impact on the subsequent modeling.To this,this article will use the vocabulary of clinical disease library and medical vocabulary of fusion method,summed up a personal dictionary,join jieba participle the vocabulary,complete and accurate cutting,the greatest extent vocabulary integrity retained the medical field.Stop word is used by Harbin Institute of Technology stop word database,Sichuan artificial intelligence laboratory stop word database and Baidu stop word database integration of the word database,a good realization of stop word function.Then comes the modeling stage.First,TF-IDF method is used to process the weight of the pre-processed corpus,and then words with relatively large weight are selected for LDA model analysis.The results after analysis can be regarded as more accurate results,and topics can be added more intuitively.
Keywords/Search Tags:Chinese Medical Journal, Journal of the American Medical Association, Text Mining, Journal Comparison, LDA Topic Model
PDF Full Text Request
Related items