Font Size: a A A

The Design And Implementation Of Chinese Social Science Thesis Analysis System Based On BERT And Citation-LDA

Posted on:2021-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:R Q JiangFull Text:PDF
GTID:2428330647450841Subject:Engineering
Abstract/Summary:PDF Full Text Request
Today's thesis library is rapidly updating,and more research fields have been created,so it is important to understand how research topics have evolved over time.Therefore,it is necessary to develop a special system to mine the academic change and knowledge flow network in a certain research field,to discover emerging major research trends and milestone thesis with significant influence.At present,most of the domestic social science thesis' analysis work is focused on statistical analysis using a combination of quantitative and qualitative analysis on independent data sets.Without the correlation analysis between data sets,it is difficult to obtain effective and intuitive analysis results.Different from English words whose basic unit is word,the processing of Chinese vocabulary will also be an important problem to be solved in the system.This system analyzes the issue of social science thesis in detail,and introduces the data characteristics of social science thesis,including many fields and more specialized vocabulary in the field.Therefore,simple data statistics cannot find information such as topic evolution.This system proposes a text preprocessing method based on the characteristics of social science thesis,including data format conversion,word segmentation,removal of stop words,and the establishment of specialized dictionaries in various fields.Because BERT(Bidirectional Encoder Representation from Transformers)and Citation-LDA(Citation Latent Dirichlet Allocation)can better generate sentence vectors and perform topic clustering,unlike ordinary content-based LDA models,Citation-LDA models that use citation information can greatly reduce the computational complexity,and because the co-citation information is used,the model can find milestone thesis that can represent the theme under the theme.Therefore,the data analysis module of this thesis is mainly based on the text representation model and the topic model.By analyzing the bibliographic information and body information of social science thesis,the core thesis and academic rheology-related information are mined.To this end,the system uses models such as Word2 Vec and BERT to generate vectors to help topic clustering,and uses LDA and citation-LDA models to analyze topic dependencies and topic evolution patterns,and integrates this module into the entire system.The system will also complete document retrieval,model management and other functions,and design and develope word segmentation training modules,which are provided for professionals to help build specialized dictionaries in various fields.The front-end of this system uses Vue.js and bootstrap frameworks to realize data analysis visualization;the back-end uses Spring Boot framework to implement data management and model management services,and data analysis models and scripts are written in Python;the data uses Elastic Search distributed engine Convenient data storage and search.The dataset uses a database of citation information provided by the Data Center of Nanjing University and full-text PDF files in the social sciences such as economics.I have undertaken back-end code writing,model building,and data training in the system.
Keywords/Search Tags:Chinese Social Science Thesis, LDA, Topic Classification, Academic Change
PDF Full Text Request
Related items