Font Size: a A A

Research On Product Entity Recognition And Domain Transfer Based On Deep Learning And Remote Supervision

Posted on:2021-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:L J BianFull Text:PDF
GTID:2518306302476194Subject:Financial Information Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the Internet era,a large amount of text data has been accumulated and extracting important information from a large amount of unstructured text is the main purpose of information extraction.Named entity recognition(NER)is a sub-task in information extraction and plays an important role in natural language processing,such as relationship extraction,knowledge map construction,knowledge answering,etc.There is a large amount of textual data in the financial industry,such as IPO texts,company annual reports,etc.If we can get the products that the listed companies mainly produce,we can assist in analyzing the sales of the product by the industry in which the product is located,and the sales of the product's upstream and downstream products.Extracting products produced by a company from a large amount of text is a named entity recognition task.The development of deep neural networks can release manpower from time-consuming feature extraction,and provides a basis for the research of named entities.Therefore,the recognition of named entity recognition has become a more concerned issue for researchers.When there is labeled data in a specific industry,named entity recognition models mainly focus on how to better learn labeled data;when there is no labeled data,entity recognition models focus on how to use existing dictionaries or knowledge maps to label high-quality data replication nerves Network learning;when there is a large amount of labeled data in one industry and no labeled data in another industry,the entity recognition model needs to focus on how to increase the applicability of the model in different industries.In this thesis,we use the IPO texts of listed companies in the financial industry and Baidu Encyclopedia data to explore the above three situations,hoping to extract high-quality product names of listed companies.Product identification tasks under specific industries are supervised product identification tasks in the specific industry,which mainly improves the product identification effect by optimizing the structure of the network model and improving the semantic representation.Network structure optimization mainly includes adding Lattice BILSTM + CRF,which is a combination of word granularity information,on the basis of the traditional BILSTM + CRF model,improving the f1 value of named entity recognition.In Lattice BILSTM + CRF,the Lattice BI-ON-LSTM + CRF model that can fuse natural language level information is obtained by changing the cell cell update method in LSTM,which further improves the performance of the product recognition model.Combining the BILSTM + CRF structure and the encoder structure in Transformer,the best model for product recognition under supervision is obtained.In addition to optimizing the network structure,it also increases the model recognition performance by fusing different semantic representations,and uses the word vectors published by Tencent to obtain a product recognition model that is superior to ordinary word vectors.When the IPO text is not labeled with data,the information in the fused dictionary is labeled.Most of the previous researches used the dictionary to match the labels.The problem is that the accuracy rate is high but the recall rate is very low.By adding artificially formulated rules,the accuracy rate can be further improved,but the problem of low recall rate still cannot be solved.The thesis uses the fine-grained words after dictionary segmentation as seed words,and uses the seed word as the initial name of the product entity in the annotation process,and then uses the part of speech to expand to both sides,because product names generally include nouns,adjectives,etc.,not Too likely to contain prepositions and other parts of speech.Using this labeled data to enter Lattice BI-ON-LSTM + CRF can achieve better recognition results.Although remote supervision reduces the probability of complete mismatch in the annotation in the way of seed word + part-of-speech expansion,the accuracy is also damaged due to part-ofspeech expansion.Through the fusion of supervised data and remote supervision data,the knowledge of the Chinese pre-trained model BERT is established to correct Finally,the model is better than the entity recognition model of Transformer BILSTM + CRF.The field of product identification is researched on the automotive industry and non-ferrous metal industry.The IPO data of the automotive industry has been marked with supervision and can be used as the source field.The non-ferrous metal industry IPO has no marked data.By removing the average word vector and unique verb average word vector of the automobile industry,the unique features of the automobile industry can be removed,and the product recognition model trained by the automobile industry can be transferred to the non-ferrous metal industry to obtain better results.By removing the average word vector of non-ferrous metal products,both the non-ferrous metal industry and the automotive industry no longer have their own industry characteristics,further reducing the gap between the source and target areas,and improving the applicability of product identification.The thesis starts from three different situations,focusing on product identification in the IPO text of a listed company,and merging available knowledge in the applicable research of supervised product identification,remotely supervised product identification,and product identification models.model.The entity extraction model can provide ideas for non-universal entity research in more specialized fields.The product extraction results can help analyze the sales of listed companies' products,can help establish the upstream and downstream product chains within the industry,and can also provide a knowledge base for specific areas Build the foundation.
Keywords/Search Tags:NER, Product Recognition, Distant Supervision, Domain Adaption
PDF Full Text Request
Related items