Research On Computer Virus Signature Automatic Extraction Technique

Posted on:2022-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:W Li

Full Text:PDF

GTID:2518306575474134

Subject:Computer technology

Abstract/Summary:

With the rapid development of China’s mobile network industry and big data technology,various types of Internet applications have emerged in an endless stream,leading to an exponential growth in the scale of Internet users,thus accumulating a massive amount of unstructured Chinese text data.Massive Chinese text data contains rich semantic information and important semantic knowledge,but there is also data noise caused by various human input errors,which reduces the overall average quality of documents,and indirectly affects the value of data mining for massive text data.However,due to the large scale of Chinese text data,the traditional manual method of text quality assessment and text error correction has a huge time cost.Therefore,how to efficiently carry out automatic quality assessment and text error correction for text data is one of the research issues that industry and academia pay close attention to.In recent years,the researchers in the field of artificial intelligence pay more and more attention to the Chinese text correction task,so this article combining with the Chinese unique syntax structure and characteristics of Chinese text error correction in the field of research for financial problems,this paper proposes a Chinese text correction method based on language model,its specific work includes the following three parts:First,Preprocessing of Chinese text data.To improve quality and utilization of the data training,comprehensive analysis of the distribution of text error correction data in the vertical field of finance,the operation of data filtering and data enhancement is carried out.The domain confusion dictionary is obtained by analyzing the training data,so as to introduce the knowledge information of the vertical domain.Pre-processing financial vertical domain corpus,used for training N-gram language model and fine-tuning BERT pre-training language model to learn domain knowledge.Second,Text error detection.To solve the problem of text error correction,the deep learning method is adopted in this paper to detect the abnormal points in the text error correction,and a comparative experiment is conducted.The new WL-BERT text error detection model is proposed by combining Word2 Vec method,bidirectional LSTM model and BERT language model.By combining word vectors with different semantics,each character in the sentence is predicted by classification to get the location of the outlier,which achieved the optimal text error correction effect in this paper,and the error detection F1 value reached 0.849.Third,Text Error Correction.The mainstream traditional model is N-gram based error correction model,but its long-distance dependent modeling ability and context understanding ability are poor.At the same time,the traditional deep learning model cannot well introduce vertical domain knowledge,so the N-gram language model and the BERT language model are integrated in this paper.First,the N-gram model was used to correct the position of the abnormal points of the sentences,which were obtained by the error detection module.Then,the error correcting candidate set was obtained by the domain confusion dictionary,and the confusion degree of different sentences was judged and the error correcting operation was performed.Next,the BERT language model is used to correct errors,thanks to the powerful context understanding ability of the attention mechanism and the unsupervised learning of the pre-training language model,which can better give the contextual error correction candidate set for each position,through the similar-confidence filter module for text error correction operation.The experimental results show that text correction performance of the proposed model is the best,the F1 value is up to 0.824,which is about 4 points higher than the F1 value of the error correction result of Bert model alone.The model integration method not only makes use of the powerful learning and generalization ability of deep learning,but also introduces the regularized confusion dictionary,which combines the advantages of different methods,so as to achieve better error correction effect of Chinese text.

Keywords/Search Tags:

Chinese text error correction, Pre-training language model, Attention mechanism, Statistical language model, Model integration

Related items

1	Research On Language Model Rescoring And Error Correction Of Transcription Results In Chinese Speech Transcription
2	Research And Implementation Of Error Detection And Error Correction Efficiency Optimization Of Chinese Text
3	OCR Error Post-correction Based On Chinese Character-level Features And Language Model
4	Research On Error Correction Method Of Chinese Short Text Based On BERT
5	The Design And Implement Of A Mobiles' Chinese Input System Based On Statistical Language Model
6	Research On Chinese Text Summary Generation Based On Pre-trained Language Model
7	Research Of Statiscal Language Model N-best Reranking Algorithm
8	Research On Sentiment Analysis Of Self-attention Mechanism Based On Pre-trained Language Model
9	Research On Error Correction Of News Text Based On Masked Language Model
10	Research On Language Model Corpus Expansion And Text Error Correction Algorithm For Speech Transcription