Font Size: a A A

Research On Chinese Document Matching Algorithm

Posted on:2021-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:J L GuoFull Text:PDF
GTID:2428330611498180Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Identifying the relationship between two documents is an important task in natural language processing area,which has been popularly applied to Internet services such as news recommendation systems or search engines.However,compared with sentence matching or query-doc matching in information retrieval,document matching is more challenging and independent since documents always contained rich semantic information and complicated structure.This thesis tries to focus on the challenges to document matching and specifically propose models for Chinese documents.We propose a document matching pipeline based on graph classification task by transferring the matching task equally to a graph classification problem.The pipeline mainly includes modeling documents pairs based on graph representation learning,feature extraction and graph classification.And two datasets are used to verify the performance of our models.The main algorithms and contribution are as follows:(1)A document matching pipeline based on graph classification.The detailed procedures as: Transfering the input documents pairs into graph-structure data(incuding vertex selection and adjacent matrix completion).Proposing a nodes features extraction algorithm based on graph convolutional neural networks(GCN).Regarding to that the na?ve GCN could not aggregating its neighbors information with different weights,which is essential for nodes features transformation,we propose to employ the graph neural networks with self-attention mechanism to extract semantic features.And we design the graph classification module based on multi-layer perceptron machine to fuse the graph structure and nodes feature to get the matching results.(2)Graph pooling enhanced document matching models.Graph pooling is the non-trival step in the graph classification pipeline since the graph representation will be generated by the fusion procedure.Hence,to keep the discriminate ability of the graph,which will contributes to the classification,we propose graph pooling models based on self-attention mechanism and graph attention to help the fusion step.In addition,in order to model the local patterns of the nodes features of graph,we try to design the convolutional neural networks(CNN)module and the recurrent neural networks(RNN)module to fuse the nodes features.(3)Input features enhanced document matching models.We consider the input features of great importance of model performance.Thus,we propose a multi-sacle CNN module to encoding the texts of each vertex,which can not only lead to excellent representation of nodes with rich and robust semantic information,but introduce more non-linear transformations.In order to model the interactions between nodes and its neighbours at the input stage,we propose to sample fixed-number neighbors for graph nodes to enhance the feature representation.And we design a model extracting the global feature of the input graph to help the final graph classification.In addition,we study the collaboration between the modules we proposed.The experiments results show our models achieve state-ot-the-art(SOTA)performance on two public datasets.Finally,we discuss the future work and directions of research on document matching task.
Keywords/Search Tags:Natural Language Processing, Text Matching, Graph Convolutional Neural Network, Graph Attention Network, Graph Pooling
PDF Full Text Request
Related items