Font Size: a A A

Research On Audit Text Classification Based On XLNet

Posted on:2024-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:S Q GuoFull Text:PDF
GTID:2568307145989359Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As artificial intelligence technology continues to develop,more and more companies and departments are using it to help with text classification and information management.Auditing is a task that requires a high degree of accuracy and reliability,and the application of AI technology to audit text classification can improve the efficiency and accuracy of the auditing process.This paper addresses the problem that the application of AI technology and natural language processing technology in the field of auditing is still in its infancy,and uses an improved XLNet model to classify audit text for the purpose of improving audit efficiency.First,for the characteristics of audit texts with strong specialisation and many proper nouns,this paper constructs a corpus of financial and audit texts for pre-training the model.For the OOV(Out of Vocabulary)problem of the XLNet model in processing Chinese audit texts due to its long word segmentation length,incomplete word list coverage and long word list length,this paper proposes an XLNet model based on Chinese word segmentation(CWSXLNet,Chinese Word Segmentation XLNet),which reduces the In this paper,we propose a Chinese Word Segmentation XLNet(CWSXLNet)model,which reduces the word segmentation granularity of the XLNet model to the character level.For the problem of word information loss caused by this move,this paper presents an in-depth analysis of the XLNet model and proposes a method using the word segmentation mask matrix.In the CWSXLNet model for computing dual-stream self-attentive scores,the subword mask matrix can enhance the attention between different words of the same word for non-masked words;for masked words,the subword mask matrix can reduce the degree of masking,thus achieving the information enhancement between words of the same phrase.The experimental results show that the performance of the CWSXLNet model is improved on both the public dataset and the audit text classification dataset compared to the XLNet model.Second,for the problem of poor classification due to long text at the audit document level,this paper proposes a double-ended self-attention pooling method for semantic extraction of long text,which achieves text length compression.The method first divides the long text into several text fragments by a sliding window mechanism,and obtains the semantic vectors of each fragment by a pre-training model.Finally,the double-ended self-attentive vectors are pooled with the first and last semantic vectors to compress the long text into a semantic vector of three times the length of the pre-trained model.Meanwhile,to further enhance the model’s capability for audit text classification tasks,this paper combines the CWSXLNet model with a double-ended selfattention pooling mechanism and introduces a Bi GRU(bi-directional gated recurrent unit)network structure as well as a downstream attention mechanism,and proposes an audit text classification model based on the CWSXLNet-DEBi GRU-Att structure.Experimental results show that the model achieves good performance in long text classification tasks.In order to verify the effectiveness and generality of the models and methods proposed in this paper,experiments were conducted on both the public dataset and the audit document classification dataset.The experimental results show that the CWSXLNet-DE-Bi GRU-Att model performs best in the long text classification task and the CWSXLNet model performs second best,both outperforming other models such as ALBERT,BERT and XLNet.Finally,in conjunction with the research in this thesis,an automatic audit document classification module has been developed for the Audit Information Security Search Engine System.The module is able to invoke the CWSXLNetDE-Bi GRU-Att model to classify audit documents as they are entered into the system,which facilitates the organisation and retrieval of audit documents and improves the efficiency of auditors.
Keywords/Search Tags:XLNet, Text classification, Intelligent auditing, Attention mechanism, BiGRU
PDF Full Text Request
Related items