| With the rapid development of chemical science and chemical technology,the number of chemical literature has also increased by leaps and bounds.Reading such a large amount of literature requires a lot of time and energy.Therefore,how to obtain valuable chemical literature information more efficiently has become an important issue for readers.focus of attention.Improving reading efficiency and saving time have become the key issues when dealing with chemical literature.In order to solve the above problems,combining natural language processing with key information extraction tasks of chemical literature,this paper proposes a sequence-to-sequence text summarization model based on ALBERT pre-training model,which aims to improve the efficiency of obtaining chemical literature information and save reading Time spent on documentation.The main research includes the following parts:1.Provides a method of introducing the Attention mechanism based on the ALBERT pre-training model to optimize the word vector features of the input text.Chemical literature contains many chemical formulas and English references of compounds.The article uses chemical entity recognition technology to obtain chemical formulas and English references in chemical literature,uses the Attention mechanism to calculate their corresponding attention word vector features,and uses the feature Weighted with the word vector features generated by the ALBERT model,the word vector is further optimized,and the vector dimensions corresponding to chemical formulas and English references are increased,so that the vectors obtained through ALBERT model encoding can better represent the literature in the field of chemical engineering.Experimental results show that this method has achieved good results.2.A combined model of ALBERT,Seq2Seq and Attention is proposed to obtain key information of chemical literature.Using the ALBERT pre-training model to encode the preprocessed text,the semantic features of the input text can be learned.Since the length of the corresponding text in the literature is variable,this model integrates the Seq2Seq model,and the addition of the Seq2Seq model makes it possible to process variable text.According to the special nature of the literature in the field of chemical engineering,the acquisition of key information is divided into three stages.The first stage is the document preprocessing part: use the Python language to parse the document,obtain the text content of the document,and filter out stop words and punctuation marks and other content irrelevant to the document content;the second stage is to obtain the chemical literature and use the ALBRT pre-training model With a multi-layer stacked attention model,the word vector features related to the context of the input text are obtained.At the same time,for chemical formulas and English references in chemical literature,chemical entity recognition technology is used to identify entities such as chemical substances and chemical reactions,and increase entity correspondence.The weight of the word vector;in the third stage,the key information acquisition model is used to perform deep semantic feature recognition on the sentences containing chemical entities in the text,and generate a simplified summary content closely related to the content of chemical literature.The experimental results show that the model constructed in this article can extract the key information in the chemical literature well.3.Based on the model proposed in the article,a key information acquisition system for chemical literature is designed and implemented.After fully analyzing the requirements of the system,the overall design of the system is carried out.The system is designed for B/S architecture and adopts the standard MVC design pattern.The view layer mainly includes the document input function module,the abstract output function module and the output content selection module,which realizes the understanding of the key information of the document through the interface,and finally introduces the detailed design and function description of each module of the system. |