In many instances of big data applications,Chinese full names often have corresponding abbreviations.Due to variations in data collection standards and sources,it is common to encounter instances where the same entity is referred to using both its full name and abbreviation.In order to improve data quality during data cleaning and integration from various sources,it is important to calculate the similarity score between the "abbreviation" and "full name" strings of fixed Chinese nouns.However,there are several challenges in developing full acronym matching technology.Traditional methods often only consider acronyms and ignore homophonic and homographic words.The efficiency of full name and acronym similarity matching is negatively correlated with acronym complexity,making it more difficult to achieve efficiency with increasingly complex acronym structures.Additionally,the unique grammar structure of Chinese makes it even more difficult to construct appropriate abbreviations for entity names.Therefore,accurately and quickly retrieving corresponding full entity names when using entity name abbreviations for information retrieval is of great importance and practical value.To address the near-homophonic and near-homographic problems caused by homophonic errors in input methods and speech recognition,as well as homographic errors in handwriting input methods,this study proposes integrating Chinese homophonic full acronym matching models with speech and image models to solve the problem of the matching process of full entity names and abbreviated names.Two Chinese full-abbreviation matching models are designed:(1)A study of the matching method for full name and abbreviation of Chinese near-phonetic characters based on Sim Bert and VGG: Combining the typical strong prior speech information of the pre-training speech recognition model of VGG and CTC,the prior speech information is used to improve the full-abbreviation matching model based on Sim Bert.Thus,speech knowledge is fused with neural networks to achieve a near-homophonic full-abbreviation matching model.This method effectively integrates speech information into deep neural networks,improving the matching accuracy of near-homophonic full abbreviations,The matching accuracy of this model was 75.2%.(2)A study of the matching method for full name and abbreviation of Chinese near-glyph characters based on fusion of multivariate information Dense Net: To address the problem of near-homographic full-abbreviation matching in Chinese full-abbreviation matching,Chinese characters are first split into Wubi,character decomposition,and font.Then,a statistical feature extractor is used to extract the features of Wubi,character decomposition,and font separately.A model that integrates character shape features is proposed,and the impact of character shape and image features on the model’s performance is explored.This approach achieves near-homographic full-abbreviation matching and improves the accuracy of the model,The matching accuracy of this model was 80.9%.(3)Research on full name and abbreviation matching method based on fusion of word voice and character glyph information: Coupling two models for full name and abbreviation matching,namely near-phonetic and near-glyph models,to work together in the recognition part of full abbreviation matching.Expert knowledge is used as a component of the feature optimizer,where the weight of expert knowledge is obtained through self-supervised learning to express the importance of expert knowledge for entity name full and abbreviation matching.The entity full name and abbreviation recognition orders are sorted based on similarity scores,and the full name and abbreviation with the highest similarity score is considered the best abbreviation.The matching accuracy of full name and abbreviations reaches 87.5%. |