Font Size: a A A

Core Entity Recognition For Web Articles Based On Tree-LSTM Model

Posted on:2022-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhouFull Text:PDF
GTID:2518306566497644Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet and online social media,the explosive growth of online text information has made the issue of "information overload" increasingly serious.A large amount of information on the Internet is difficult to distinguish between true and false,which also increases the cost of obtaining the effective information.The core entity is the main description object of an article,or the main role of article entities.Identifying the core entities in online articles will help people quickly grasp the main content of articles from a large amount of text information and obtain effective information in time.Because the online articles involve many areas and,various text structures,and have inconsistent distribution of core entities and inconsistent statistical features,it is impossible to clearly describe the semantic features of core entities.In addition,the boundaries of the core entity words in article is difficult to mark,and long entity words have the phenomenon of word combination and word nesting,which makes the extraction of core entityies more difficult.Furthermore,identifying the core entity actually needs to grasp the main description object from article based on paragraph or whole text level comprehension.According to practical requirements,the thesis has carried out the following researches around the effect of long-distance text information on core entity recognition of online articles.(1)BiLSTM-CRF model is widely used in natural language processing task due to the fact that it can capture long-distance dependence.In practice,owing to the Chinese word-segmentation issues,the BiLSTMCRF model usually uses character embedding instead of word level.Character embedding is not an ideal choice for capturing exact semantic expressions.Therefore,according to the characteristics of tasks and thanks to the core entity's word combination and nesting phenomenon,the thesis proposes a word-level BiLSTM-CRF method for article core entity recognition.(2)To solve the problem that BiLSTM-CRF is difficult to capture the long-distance text information in a complex semantic environment,the thesis proposes a method of identifying the core entities of articles based on a Tree-LSTM-CRF model.Based on the syntactic dependency and hierarchical structure of the article,the model constructs a tree-like text understanding dependency structure.By making use of the bottom-up information transmission mode and information memory ability of Tree-LSTM,the model can well capture the long-distance text information in the article,thus improving the effect of identifying core entities.Experiment results show that the F1 value has increased by 11.57%comparing with the BiLSTMCRF model.(3)Aiming at the defect of imperfect interaction between words and text information in Tree-LSTMCRF model,the thesis further proposes an Attention-Based Tree-LSTM-CRF model.The hierarchical attention based on text information is introduced into the Tree-LSTM-CRF model.Through the information interaction between words and sentences,paragraphs and articles,the feature of the importance of words to sentences,paragraphs and articles is successfully captured,which increases the interaction between sentences and text information and improves the ability of the model to identify core entities.Experiment results show that the F1 value of this improved model has improved by 24.58%compared with BiLSTM-CRF and 11.66%compared with Tree-LSTM-CRF.The performance has been further improved.
Keywords/Search Tags:Core entity recognition, Tree-LSTM-CRF model, Attention mechanism
PDF Full Text Request
Related items