Research On The Extraction And Evolution Of Hot Topics In Scientific And Technological Literature

Posted on:2022-01-29

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J L Wang

Full Text:PDF

GTID:1488306731961819

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Science and technology(S&T)literature resources are the carrier and precious wealth of human knowledge,and are the experience summaries,ideological crystallization and research results of human beings in S&T practice activities.S&T is a highly concentrated S&T knowledge,which is scientific,innovative,comprehensive and trustful.It records the progress of science and technology and the development of society,and helps to promote our country's technological innovation,economic development and social progress.With the rapid development of big data,artificial intelligence,cloud computing and other technologies,S&T literature resources have exploded,which has brought many problems such as a wide variety of types,complex content,low utilization,and waste of resources.In the face of massive S&T literature resources,how to make full use of them,quickly extract valuable and meaningful information,improve the utilization rate,promoting the sharing and technological innovation are important issues that urgently need to be resolved.Based on this,this thesis analyzes the existing related technologies and methods of processing text.Aiming at the types,characteristics and existing problems of scientific literature,we propose a text topic extraction method based on topic models,and then use topic coherence to evaluate the quality of the extracted topics,and further propose a variety of topic evolution models to analyze the evolution of the extracted topics.We can comprehensively,scientifically and quickly understand and master the research hotspots,development rules,research frontiers,research directions,research characteristics and future development and evolution trends of related research fields.Specifically,this thesis mainly includes the following:(1)We elaborated on the definition,classification,characteristics,existing problems,utilization,sharing and innovation of S&T literature,put forward research questions,and then introduced technologies,methods and means related to text processing.Then we analyzed the concepts of topic extraction models and topic evolution from multiple levels and angles.We further discussed the essential links of text mining,namely text data modeling theory and representation,text preprocessing and text mining toolkits,etc.,which lay the foundation for large-scale text topic extraction,topic quality evaluation and topic evolution.(2)Topic extraction.Aiming at problems such as the inability to directly and quickly obtain valuable information and knowledge from scientific and technological literature,this thesis proposes four models and methods for topic extraction and analysis of titles,keywords,abstracts,etc.of the literature in the blockchain field.These four are Time-based Dynamic Latent Dirichlet Allocation(TDLDA),time series-based Dynamic Nonnegative Matrix Factorization(DNMF),selection criteria for important topics,calculation models for topic strength.First,we use the python third-party library jieba to segment and count the text of the title,keywords,and abstracts of the document,and get a document that only retains the key words.Second,use the TDLDA and DNMF topic model to extract the topics of the document titles,keywords,and abstracts in each year to obtain the topic words under different K values,and summarize the corresponding topics according to the topic words.Then,according to the selection criteria of important topics,calculate the importance of topics for each year's document.Then,we merge the topics extracted from the document titles by TDLDA and DNMF each year with the topics extracted from the2016-2021 topic importance table to obtain the topics corresponding to the TDLDA-title and DNMF-title.The two merge together as the topic of blockchain literature title extraction.In the same way,we can get the topic of keywords and the topic of abstract extraction.Finally,merge the topics extracted from the blockchain literature titles,keywords,and abstracts as the core hot topics in the field,and further calculate the topic strength based on the topic strength model,and select the top ranked topics for analysis and discussion.(3)Topic quality evaluation.Aiming at how to measure the quality of topics and obtain high-quality topics,this thesis proposes two models to evaluate topic quality,namely Topic coherence and an expert-based topic relevance evaluation method.First,we use PCA and SVD as the baseline to calculate the Topic coherence values of the topics extracted from titles,keywords and abstracts by TDLDA and DNMF under different K values and different Top N.According to the Topic coherence value,we can determine which topic model extracts the topic quality better.Second,specific examples are used to verify the correctness,feasibility and rationality of the above conclusions.From the topics extracted from titles,keywords and abstracts by TDLDA and DNMF,6 identical topics are selected as the data set,and the evaluation method based on expert evaluation is used to calculate it.According to the degree of relevance of a topic,which topic model is more effective is judged according to the degree of relevance that can judge which topic model has the better extraction effect.(4)Topic evolution.Aiming at how to track the dynamic changes of the topic and reveal the evolution of the topic,this thesis proposes four evolution methods to analyze the evolution of keywords and topics,explaining and showing the evolution trends of topics from different dimensions,including evolution method based on word cloud diagrams,normalized models based on lucence scores,evolution models of topic strength,and topics Correlation strength evolution model.We use the preprocessed keywords of the titles,keywords,and abstracts of the documents from2016 to 2021 as the data set,and use word cloud graphs to visualize them respectively.This method can observe the evolution of a certain keyword as a whole and roughly.Second,we use the 40 keywords obtained from the title of the literature as the data set,use the normalized model of lucence score to calculate the ratio in 6 years,and observe its evolution trend.This method can specifically see the appearance of a certain keyword,disappearing time to see the evolution trend of each keyword more clearly and intuitively.Then,we use DNMF-title,DNMF-keywords,TDLDA-summary and other topics with good quality effects as the data set,and use the topic strength model to calculate the topic strengths to obtain topics that are on the rise,steady,and declining trends.This method analyzes the evolution trend of each topic in 6 years from the perspective of word frequency statistics.Finally,we use the topics extracted by TDLDA-summary as the data set,and use the topic relevance strength to calculate the relevance of several topics that appear collectively each year.This method analyzes from the perspective of semantic relevance,and it can be seen that a certain topic is focus and sharpness.

Keywords/Search Tags:

scientific literature, topic extraction, topic model, topic quality evaluation, topic evolution

PDF Full Text Request

Related items

1	The Research On Topic Evolution For Chinese Literature Of Science And Technology Based On LDA
2	Research On Method Of Constructing Temporal Topic Chains Based On The Scientific Literature
3	Topic Discovery And Trend Analysis In Scientific Literature Based On Topic Model
4	Topic Analysis And Recommendation System Based On Scientific Research Documents
5	Research On Deep Processing And Topic Evolution Of English Scientific And Technical Literature For Selective Dissemination Of Information
6	Research On Topic Evolution In Social Networks
7	Research On Probabilistic Topic Model And Its Application In Multimedia Topic Discovery And Evolution
8	The Design And Implementation Of Topic Evolution Tracking For Micro-Blog
9	Research On Evolution Model Of Microblog Topic Based On Time Sequence
10	The Research On Topic Access And Evolution With LDA