Font Size: a A A

Design And Implementation Of HelloPaper:An Automatic System For Document Analysis

Posted on:2022-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z F LongFull Text:PDF
GTID:2518306737453344Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The collection,collation and analysis of document is the leading step of many scientific research work.The rapid increase in the number of document and the large number of scientific research groups bring the urgent demand for document analysis.In order to meet the needs of massive document analysis in the new era,this paper constructs an automatic document analysis system Hello Paper based on the Chinese and foreign documents collected by CNKI.After users enter the search criteria,Hello Paper will automatically complete the work of document collection,collation and analysis,and present a document analysis report with pictures and text to users.This paper first explores the research status of document analysis system.We find that although there are many researches on document analysis system at home and abroad,we still cannot find an automatic document analysis system that integrates the whole process of document collection,collation and analysis.Therefore,based on the core function of the system,namely document analysis,we design a document analysis framework which consists of macro analysis and micro analysis.The macro analysis of document helps users to understand the general situation of the research,and the micro analysis of document helps users to explore the research content.On the basis of this framework,we design a document recommendation mechanism which combines quantitative and random recommendation according to the needs of document micro analysis,that is,we not only make recommendations based on quantitative indicators,but also make random recommendations.On the basis of this recommendation mechanism,aiming at the recommendation based on quantitative indicators,we design a document quality evaluation index system which comprehensively considers the influence of document itself,document authors,document journals,document references and document citations,and expounds the rationality of the evaluation index system based on the design principles of the index system and Pearson correlation coefficient.Based on these designs,we present the architecture of Hello Paper.Hello Paper is implemented by Python and consists of crawler module,preprocessing module,analysis module,graphical interface module and log module.The crawler module is responsible for the acquisition of document data.The preprocessing module is responsible for the preprocessing of document data.The analysis module is responsible for the analysis of document data after preprocessing.The graphical interface module is responsible for interacting with the user,receiving the retrieval conditions input by the user,and informing the user of the running results of the system.The log module is responsible for recording the running process of crawler module,preprocessing module,analysis module and graphical interface module,so as to monitor the running status of the program and find the problems of the program.In the analysis module,we apply statistical methods such as index evaluation,statistical graphics and clustering to the macro and micro analysis of document.Finally,with the help of the document analysis report and document data provided by Hello Paper,this paper expounds the research and application status of crawler as an important data acquisition technology in China.
Keywords/Search Tags:document analysis, document analysis system, Python, web crawler, document recommendation
PDF Full Text Request
Related items