| In the field of text content review,reviewers need to spend a lot of time checking comments and relevant text content to ensure that they do not contain any illegal content.Although many websites and forums have special lists of prohibited words that can be reviewed automatically to a certain extent,it is not enough to rely solely on the list of prohibited words.Especially when it comes to dangerous behaviors such as drug trade,drug dealers will use seemingly harmless terms to replace prohibited words to avoid being reviewed.For example,they will use "ice" instead of heroin and "leaves" instead of marijuana.Therefore,it is necessary to conduct term discovery and term identification in order to discover new terms in time and determine whether the terms express the meaning of prohibited words in a specific context.At present,the methods that rely on forbidden words need constant research by experts in the field to find new terms,and need manual term identification.This solution has two problems:high labor costs and difficulty in ensuring timely tracking of new terms and their use.In order to reduce the burden of censors and improve the timeliness of term discovery,this paper studies how to automate the term discovery and identification in drug transactions,as follows:First of all,this paper proposes a new process to label data in an automated way.Although there are many existing corpora,the research and annotation data related to this study are still insufficient.Therefore,this paper analyzes several data sources from previous research and other network sources,extracts the content related to drug trade as a positive sample,and uses the text in common discussion as a negative sample to build a term identification data set.This data set lays the foundation for subsequent experiments.In the process of building the data set,this paper cleaned the data and extracted the main content,which solved the problem of noisy and lengthy original data,thus improving the efficiency of model training.In addition,this paper randomly scrambles the tags to avoid the interference of the common features of the same source data on the subsequent experiments.These works provide strong support for followup research.Secondly,this paper proposes a semi-supervised scheme for sample imbalance in data to complete term identification.Through the observation of the data,this paper found the imbalance of the data samples and the actual situation of a large number of unlabeled data.Referring to the latest relevant research,this paper conducts semi-supervised labeling of unlabeled data,and after labeling,more select a few types of data to ensure a high confidence level,while mitigating the sample imbalance.This paper improves the scheme,including changing the sampling scheme to threshold-based,and adding penalty and attenuation terms to optimize the sampling results.The experimental results show that these improvements have achieved good results.The method surpasses the original method in this dataset and is applicable to multiple models.Finally,a term discovery method based on the MLM Mask Language Model and incorporating word vectors as post-processing steps was proposed.Specifically,this paper first conducts targeted mask training on the relevant data to make the model better identify the corresponding context,and determines the candidate terms through the MLM model.Then,word vectors are trained on different corpora,and vector alignment is performed using Procrustes method.Calculate the cosine distance of the word vector of each candidate term on different corpora to determine whether the term may have other meanings,so as to filter the results.This method significantly improves the effect of single MLM mask method without increasing too much calculation cost,and has good application value in the field of term discovery. |