Font Size: a A A

Domain Oriented Text Entity Recognition And Association Discovery For Multi Source Data

Posted on:2022-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:P PengFull Text:PDF
GTID:2518306338468684Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The field of emergency management is an important field closely related to national security and social stability.The use of artificial intelligence technology to automatically discover and identify valuable information in the field of multi-source data text,such as domain core entity and its semantic association,has important guiding significance for emergency early warning and response,Entity recognition and entity association discovery are the key technologies for automatic domain information extraction.However,the domain entity recognition and association discovery technology is still facing the following challenges:firstly,the lack of word segmentation boundary and the complexity of morphology and grammar in Chinese text make it difficult to use word features,which limits the effect of entity recognition;Secondly,due to the lack of domain annotation set and the high cost of manual annotation,the mainstream model can not be fully trained based on the full supervision method;Finally,the sparsity and specificity of domain information also lead to the low efficiency and accuracy of entity association discovery.In order to solve the above problems,the following improvements are implemented in this paper(1)Aiming at the difficulty of using Chinese word information,an entity recognition model based on adaptive combination of words is proposed.Firstly,convolution network is used to realize character window information perception,and then multi attention is calculated with potential words to adaptively combine Chinese word information;At the same time,large-scale prior knowledge is introduced by combining with Chinese pre training model.Finally,the recognition effect is verified on resumener and weiboner.Compared with the best baseline model flat,the recognition F1 value is improved by 0.32%and 0.69%respectively.After introducing the Chinese pre training model RoBERTa-wwm,the recognition effect is further improved.(2)Aiming at the high cost of manual sequence annotation of domain entities,a framework of domain entity recognition and extension based on remote supervision is proposed.The framework uses dictionary remote supervision flow annotation to get training set,and combines Pu learning algorithm to train the basic model CWAI-R proposed in this paper,which saves a lot of cost of manual annotation,It also introduces the self training of teacher student model to realize the generalization of training semantics.Finally,the domain entity recognition verification is carried out by using the person name,organization name and professional name of resumemer evaluation set,and the F1 value of recognition is only 2.2%lower than that of the full supervised baseline model based on manual sequence annotation,which proves that it can still achieve better recognition effect without manual annotation;In addition,the weapon entity is expanded and verified by using the emergency management domain data set,and the entity expansion rate and expansion accuracy are 107.4%and 81.3%respectively.(3)In order to solve the problems of more non domain noise and strong domain representation in the collected data,this paper proposes a multi-source entity association discovery framework based on RoBERTa-wwm.The first part of the framework is the domain semantic discriminator based on fasttext,which can realize the fast pre screening of non domain noise text;The second part is the entity association discovery model based on RoBERTa-wwm.The model adaptively completes the character semantic calculation through RoBERTa-wwm,and then extracts the window information through convolution neural network.Finally,the entity association inference is completed based on entity information,inter entity semantic dependency information and global semantic information.Finally,a comparative experiment on the emergency management domain dataset constructed in this paper shows that the F1 value of the domain semantic discriminator is only 0.4%lower than the best baseline model bet CLS,but its discrimination speed is nearly 3000 times that of the latter in the same CPU environment.Even if the latter is migrated to the GPU environment,the discrimination speed is still less than 1/500 of the former;The effect of entity association discovery based on RoBERTa-wwm is better than the baseline model.Finally,based on the above improvements,an emergency management domain information extraction system is developed,which can automatically realize information collection,domain information discrimination,entity recognition and entity association discovery.The test results show that the system has a high degree of automation and analysis accuracy,and meets the requirements of the field.
Keywords/Search Tags:entity recognition, remote supervision, association discovery, Pre training model, Emergency safety management
PDF Full Text Request
Related items