| As the Internet develops rapidly and smartphones become more widely used,people have increasingly diverse ways and methods of obtaining information.While the Internet has brought convenience to people,it has also provided opportunities for criminals who mix harmful content into the massive amount of information online,posing significant threats to people’s property security and social stability and harmony.Various forms of fraudulent texts widely exist in textual messages and social media platforms,with frequent updates to evade network supervision by intentionally creating misspellings and replacing fraudulent words with different ones.Traditional fraudulent text recognition methods cannot dynamically respond to changes in fraudulent texts,resulting in low accuracy in identifying new types of frauds and new forms of text obfuscation.The high accuracy algorithms have the problem of slow reasoning speed.To address these problems,this thesis analyzes the misspelling substitutions in fraudulent texts and conducts in-depth research on spelling correction,type recognition,and model compression.The main contributions and innovations of this thesis are as follows:1.This thesis constructs a new Chinese fraudulent text dataset.Since existing Chinese fraudulent text datasets are relatively outdated and have significant differences from current fraud forms,this thesis supplements public datasets with data obtained from real scenarios and annotates data for three tasks:spelling correction,binary classification of fraud or not,and multi-class intention recognition.This dataset is more realistic than previous public datasets and helps researchers to analyze Chinese fraudulent texts more comprehensively.2.This thesis proposes a Chinese spelling correction algorithm based on gated feature fusion.The algorithm is used to detect and correct misspellings in fraudulent texts and other domain texts.The algorithm selectively fuses information about semantics,pronunciation,and glyph of Chinese characters using gate networks to improve model correction and error interpretation capabilities.Compared with the baseline models,experiments show that the algorithm achieves the best correction effect in both the fraudulent text dataset and the SIGHAN dataset.The effectiveness of each module is verified through ablation experiments,and typical cases are analyzed to verify that the algorithm can effectively explain the reasons for errors.3.This thesis proposes a fraudulent text recognition algorithm based on prompt and spelling checking.The algorithm aligns the fraudulent text classification task and the spelling correction task by using prompt learning on the basis of the spelling correction model,optimizing both tasks to avoid building additional classifier.The algorithm can effectively deal with obfuscated texts and new types of frauds.Through experiments comparing with the baseline models,the algorithm shows excellent performance in fraudulent text recognition task.The analysis of the model’s attention weight verifies that the algorithm pays attention to suspicious words during prediction.4.This thesis designs a lightweight scheme for fraudulent text recognition model based on knowledge distillation.This method compresses the model size by distilling knowledge from four perspectives:gate vectors,hidden layer outputs,attention matrices,and backbone outputs.Through comparative experiments,the distilled student model outperforms the BERT-based fraudulent text recognition model with about one-fifth of the parameters.Ablation experiments verify the effectiveness of each part of loss function in the knowledge distillation method used in this thesis. |