Font Size: a A A

Research On Word Migration In The Process Of Scientific Research Topic Evolution

Posted on:2018-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:B T ChenFull Text:PDF
GTID:1368330515489620Subject:Information Science
Abstract/Summary:PDF Full Text Request
The evolution of the scientific research topic and the related content analysis is a long-term concern in the field of information science.At present,the large number of academic literature not only poses a challenge for the analysis of scientific research topics,but also provides sufficient resources for the academic text mining.The evolution of research topics is a dynamic process.During the development of a research field,new themes emerge,existing themes becomes rather active,mature or dies out.Research content of each topic changes over time:A single topic splits into multiple topics,and multiple topics merges into a new one.Understanding the evolution of scientific research topics and conducting in-depth content analysis,can help new researchers to get an overview of the field,help experts to communicate within and across domains,and provide the development of scientific innovation to scientific research funding managers and policy makers,and facilitates decision making based on the interactive flow of domain knowledge.In view of the importance of the study on scientific research topics,disciplines such as data mining as a representative have paid much attention to it.In contrast,the field of information science studied less for the evolution of scientific research topics,and especially lack studies for the evolving dynamics and the changing of topic structure during its evolution.In the meanwhile,in the field of computer science and other related fields,due to the characteristics of the technology-oriented nature of these domain,the research on the evolution of scientific research topics focuses more on the construction and optimization of the evolutionary model.The researches are neglected to explore the status of knowledge exchange and the changing developing status of topics in different periods,and lack further research into the word level of content analysis.The present status of the evolution of scientific research topics is that,in the field of information science needs advanced technical methods to analyze the changing of topic structure or recognize and extract the distribution of words among topics.For the field of data mining,because of its technical-oriented features,it needs the complementary from in-depth content analysis of scientific research topics.Based on the above-mentioned summary,this thesis combines the topic modeling and text mining methods in the fields of data mining and machine learning,with the advantage of content analysis in the field of information science.Using the scientific literature in the field of information retrieval as a case study data set,the distribution of words and semantic word shifts in the domain topics during the topic evolution are examined.The thesis contains seven chapters in total.Chapter one introduces the background and research significance of this research,the research status quo around the world,the research content,methodologies and innovation of this research.The second chapter discusses the theoretical basis,including the transformation of the scientific paradigm,the bayesian network and the construction of topic models,as well as the general definition of word semantics,providing theoretical basis for the subsequent topic extraction,evolution analysis and study of word distributions in the topics.Chapter three is about discovering scientific topics in unstructured large collections of scientific documents.Based on the LDA topic model,the topic extraction and analysis of the text data set constructed from the research literature are carried out.The selected data source are the research papers in the field of information retrieval.The data are retrievaled from the Web of Science database,and the searching time span is 1956-2014,with a total of 20,359 research papers.In total,five major topics are extracted in the field of information retrieval for the follow-up evolutionary research and word analysis.The fourth chapter analyzes the evolution process of the scientific research topics,and identifies and examines the growth trend and evolution dynamics of the five major topics in the field of information retrieval.In the analysis of growth trends,the per document topic probability distribution in the training results of the LDA topic model is aggregated year by year to obtain the proportion of the content of each topic comparing to the total content of all the content in the papers of a particular year.Responsing to the fact that the current measurement of the topic activity is still based on the very simple counting of the published literature,the results of the growth trend analysis in this chapter better preserve the characteristics of a document containing multiple topics per different proportions.In the evolving dynamics analysis,aiming at the shortcomings of the current topic evolution research on the splitting and merging of the topics,the exchange of knowledge and the analysis of the development status of topics in different periods,the research on the above three points is carried out.The whole corpus is divided into six time windows,each time window to extract the local topics that only exists in the corresponding time period.The five major topics extracted in the third chapter is hereby called the global topics.The knowledge exchange happening within topics and between topics are represented by the splitting and merging between the local topics.By calculating the similarity of the probability distribution of the topic terms,we can get the correlation between a global topic and a local topic,and the splitting and merging between the local topics in adjacent time spans.The relevance of the local topics to the global topics in different periods can reflect the development status of the global topic at certain periods.In the fifth chapter,based on the previous chapters,the evolution of scientific research topics is further examined from the word level,focusing on the word migration phenomenon during the process of scientific topic evolution.Scientific research topics are represented as collections of words with semantic functions,and the evolution of research topics is essentially a change in the innovation and application related to words.Starting from the word analysis,is to further understand the key to the evolution ofscientific research process.This chapter first expounds the universality of thephenomenon of word migration,and expresses the definition of word migration,that is,the same words appear in different topics.Analogous to the real-world migration phenomenon(such as the geographical migration of human populations),the word is equivalent to the human crowd,and the topics are equivalent to different territories.The types,stability and semantic shifts during the word migration process are measured and analyzed.Chapter six verifies and analyzes the general laws of the word migration activities.Three assumptions about the laws of word migration are proposed.One of them is the similarity hypothesis:words with similar context have similar migration directions;the second is the diversity hypothesis:polysemous wordsty tend to have higher probability of migration;The third law is call the coherence hypothesis:important words in the topics have a lower degree of migration.According to the definition of information entropy in the information theory,this chapter first quantifies the degree of word migration,in order to facilitate the subsequent verification of the laws of word migration.The law of similarity expresses the relationship between semantic similarity of words and the direction of word migration.Word2vec word embedding model is used to express words as word vectors.The similarity of words is represented by the cosine similarity between word vectors.The law of diversity expresses the relationship between the diversity of word semantics and the degree of word migration.The polysemy of words is represented by the local clustering coefficients of the calculated word vectors in the K-nearest neighbor graph.The law of coherence represents the relationship between the importance of the word to the topic and the degree of word migration.The importance of the word is expressed by the calculation of the tf-idf score in the assigned topic.Chapter seven summarizes the study and proposes potential future directions.The content includes the conclusions,the research insufficiency and the prospect.Through theoretical and empirical analysis,this paper has the following three conclusions:(1)The development and evolution of the five major topics in the field of information retrieval,in general,follow the transition from the adjustment status to the mature status.Some of the topics in the mature status,may re-enter an adjustment status,after the introduction of new knowledge and the reorganization of topic contents,to achieve a new mature period.The topic knowledge exchange,which is reflected by the topic splitting and merging activities,happens both within the topic itself and between the topics.The earlier developed topics in the field will produce knowledge output to later topics.In contrast,later developed topics will also feedback its innovative technologies and methods to form knowledge flows.Some of the topics due to its uniqueness and coherence research themes,its knowledge exchange with other topics are less,thus forming a rather closed topic developing path.(2)The evolution of scientific research is essentially the change of words and word semantics.Understanding such changes of the core words in scientific research topics at different periods is the key to in-depth analysis of the evolution of scientific research topics.In this paper,same word appear in different topics is defined as word migration.The phenomenon of word migration is concerned with the change of semantic meaning of words.In the process of the evolution of scientific research,word migration is in fact the change of innovation and application associated with the words.Word migration activities can be summarized into three types:non-migration,dual-migration and multi-migration.When multiple words in a certain topic show a tendency to migrate to other topics,the heat of the research problem related to this topic declines in the field,and the topic is in the process of contraction and recession as a whole.As of the stability of word migration,we mainly focus on convergent words and divergent words.The divergence of the words reflects the process of the development of the semantics of the words from the subjectivity to the topic-specific development.In contrast,the convergent relativity of the words usually reflects that the research and application associated with the words are of interest in multiple topics,and the related innovation has become a hot topic in the domain.(3)By examining the relationship between word context similarity,semantic diversity and the importance in a topic with the word migration direction and degree of migration,this paper proposes three general laws about word migration activities.One is the law of similarity:words with similar context have similar migration direction;the second is the law of diversity:polysemous words have a high degree of migration;thethird is the law of coherence:important words in the topics have a lower degree ofmigration.The study shows that,the migration status of high probability words in the topics in the field information retrieval verifies the three laws of word migration.For the law of similarity,words with similar context mainly include two types:synonyms and frequent co-occurrence phrases.These words usually have similar migration directions.However,when multiple words often co-occur with each other,the semantics between these words will be influenced,making the formation of inconsistencies in the process of migration.There is a connection between the law of diversity and the law of coherence.When the law of diversity is taken into account,the less meaning a word has,the more it is likely to stay in one topic.But when we consider the law of coherence at the same time,less polysemous words may be important to multiple topics,or even the word is usually embedded in the same context,but it is often used by multiple topics,which will also make the words appear in a number of topics and form migration.
Keywords/Search Tags:Topic evolution, Topic model, Word migration, Semantic analysis, Content analysis
PDF Full Text Request
Related items