Font Size: a A A

Research On Formula Retrieval Model Based On N-ary Tree Structure And Word Embedding

Posted on:2022-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y F DaiFull Text:PDF
GTID:2518306479993229Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Mathematical formula retrieval is an important research topic in the field of information retrieval since its retrieval objects are mainly mathematical formulas with twodimensional structure characteristics.Traditional text retrieval methods are difficult to capture the structure information of formulas,resulting in low retrieval accuracy and low retrieval efficiency,which cannot well meet the needs from various professional fields for formula retrieval.Therefore,how to capture the features of the specially structured data such as mathematical formulas and apply them to the information retrieval systems have become an urgent problem to be solved.By combining word embedding technologies in natural language processing,a formula embedding model is proposed in this thesis,and a pure formula retrieval model and two new hybrid retrieval models are built to realize the efficient and accurate retrieval of mathematical formulas.The main work of this thesis is listed as follows.(1)According to the special structural characteristics of mathematical formulas,a formula embedding model called NTFEM is proposed in this thesis.First,by extracting mathematical formulas in Math ML format from the documents and converting them into N-ary formula tree structure,the two-dimensional structure information of the formulas is captured.Second,the tokenization method of the formulas is defined.The formula is extended from the two-dimensional structure to a new one-dimensional linear sequence,and the substructure vectors of the formulas are learned by combining the word embedding models.Finally,considering the importance analysis of each node of the formula trees,a weighting algorithm for the substructure of the mathematical formulas is proposed to further improve the embedding effect of the formulas,and the similarity matching mechanism of the formulas is established.According to the results of the NTCIR-12 Wikipedia formula retrieval task,our pure formula retrieval model NTFEM can outperform the traditional formula search engines.The NTFEM not only improves the retrieval efficiency,and greatly reduces training time and improves training efficiency.(2)Considering the text information around the formulas,this thesis proposes two hybrid retrieval models called NTFEM-T and NTFEM-K,aiming at two retrieval scenarios of formulas combining long text and formulas combining keywords.The NTFEM-T model first extracts the formulas and text information respectively from the documents and then learns the text features around the formulas through the word embedding model.While the NTFEM-K model obtains the features of keywords around the surrounding text through keywords extraction and embedding techniques.At the same time,both models embed the extracted formulas through NTFEM.Finally,combining features of text information and formulas,two hybrid retrieval models are built.Compared with the pure formula retrieval model,the hybrid retrieval models achieve better results on Topic-eq dataset,which proves that the method of combining text information around the formulas can effectively supplement the semantic features missing in the formula structure,and further improve the performance of the retrieval model to meet more diverse retrieval needs.The mathematical formula retrieval models proposed in this thesis effectively apply the word embedding techniques in natural language processing to the special twodimensional structural information of mathematical formulas.The experimental results show that the models proposed in this thesis achieve better results on the task of formula retrieval than the traditional formula retrieval systems.In addition,by combining the text information around the formulas,the semantic features of the formulas are supplemented,and the accuracy and efficiency of formula retrieval are further improved.
Keywords/Search Tags:Mathematical Formula Retrieval, Formula Embedding, Word Embedding, Mathematical Formula Similarity
PDF Full Text Request
Related items