Similarity Computing Of Scientific And Technical Documents Based On Texts And Formulas

Posted on:2020-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Xu

Full Text:PDF

GTID:2428330596485247

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Commonly used document similarity calculation methods mainly include set model method?vector space model method?Latent Semantic Analysis,etc.These methods only use text information to calculate document similarity.Howerver,the scientific and technical document contains a large amount of non-text information,such as formulas,graphs and tables,which makes the original method not suitable.A method for calculating the similarity of scientific and technical document is proposed based on the text and formula.This method mainly considers the information of text and formula in the scientific and technical document.Based on the similarity of text and Formula between the documents,the linear combination method is used to obtain the similarity of scientific and technical documents.The KNN classification algorithm is used to compare the classification performance of the method of text and formula and vector space model method.The experimental results on the MREC dataset show that the method of text and formula can increase the macro-average F1-score(MF)by up to 3%.Combining the formula information to calculate the similarity of the scientific and technical document not only effectively improve the accuracy of the similarity of the scientific and technical document,but also realize the similarity calculation of the cross-language scientific and technical document.The main research results of this paper include:There are many methods for calculating the formula similarity.Under the precondition of not considering formula variables,two methods for calculating formula similarity are proposed.Aiming at the problem that the ordering of the feature elements of the formula is not considered in the text-based method,a method for calculating the formula similarity is proposed based on feature serialization.The method extracts the operator,constant and parentheses of the formula as its feature elements,maps the position of the formula feature elements to the position vector,and obtains the formula similarity by calculating whether the position vectors are equal;Aiming at the problem of invalid matching of subtrees in the hybrid method,a method of formula similarity calculation is proposed based on valid matching subtree.Firstly,the valid subtree multiple set is obtained according to the preorder traversal strategy,and then the method finds all the valid matches by using the information of the exchange of the first child node of parent node of vaild subtree and whether the vaildsubtree has been matched.Finally,the method considers the influence of the number of nodes and the hierarchy of valid match subtree in the parse tree on the weight of the valid matching subtree,and gives weight calculation method of the vaild matching subtree to obtain the formula similarity.The effectiveness of the method of valid matching subtree is verified by experiments.A method for calculating similarity between documents is proposed based on KM algorithm to ensure one-to-one matching of formulas and reasonable quantification of formula similarity between documents.The method uses the formula similarity to construct a weighted bipartite graph,uses the KM algorithm to find the maximum weight matching of the weighted bipartite graph,and calculates the formula similarity between documents by using the maximum weight matching and the number of formulas between documents.

Keywords/Search Tags:

Scientific and technical document, document similarity, Formula similarity, Feature serialization, Vaild mathching subtree, Formula between documents

PDF Full Text Request

Related items

1	An English Scientific Document Retrieval Method Based On Formula Description Structure And Word Embedding
2	Extraction Of Mathematics Formulas In Chinese Scientific Document
3	Research On Semantic Similarity Computation And Applications
4	Mathematical Formula Extraction In Printed-Chinese Documents Based On EEN Feature Function
5	Mathematical Formula Feature Extraction And Locating In Chinese Scanned Printed Document
6	The Improved Algorithm For Identifying Mathematical Formulas In The Images Of PDF Documents
7	Research On The Mathematical Formula Recognition Technology For Printed Document
8	A Retrieval Model Of Scientific Documents Based On Mathematical Expression Features
9	Extraction, Recognition And Reconstruction Of Mathematics Formulas In English Scientific Document
10	The Extraction Of Mathematical Formulas In Word Documents For Math Retrieval