| How to effectively extract the massive media information in the Internet is a work of great significance for financial market investors and financial researchers.This study focuses on the basic work of semantic matching between Internet media documents and stocks.Systematically combs the relevant studies,it is found that the specialized research in this area is not very deep and complete,and the semantic information extraction depth is not deep enough.Based on this,this study starts from the existing theories and technologies in Information Science,explores the method of deep semantic matching between Internet media documents and stocks at the semantic level,hoping to provide a solid foundation for financial market investors and researchers in the field.This study maps the semantic matching between Internet media documents and stocks into the question of extreme Multi-Label classification.Taking Transformer pretrained language model,a cutting-edge research achievement in the field of NLP,as the main technical means,this study carries out the experiment by drawing lessons from the "Indexing-Matching-Ranking" three-stage X-Transformer model.In the "Indexing",stocks are divided into 10 asset categories by using the descriptive information of individual stock assets,and the mapping relationship between stocks and asset categories is obtained;In "Matching",firstly,on the basis of literal matching results,the individual stock asset co-occurrence relationship information is innovatively introduced as the external rule item of data annotation,which greatly reduces the time and labor cost of data annotation;Then,based on the mapping relationship obtained in the "Indexing",the Transformer pre-trained language model is used to train the multi label classification model from media information to asset classes;In the "Ranking",based on the asset class matched by the media information,the Liblinear classifier is used to find the most matching stock assets.The experimental results are tested by accuracy,recall and F1,and the model shows good performance.Combining the literal and semantic matching results,the evaluation indicators were further improved.The comparison experiment of multi label classification based on direct Transformer model and the semantic-level verification experiment of the correlation measure between stocks correlation matrix based on semantic matching and stocks correlation matrix based on log return mutual information are carried out.The results strongly verify the effectiveness of X-Transformer model in semantic information extraction and extreme multi label classification tasks.The research provides an effective supplement and reference method for the research on the matching of Internet media documents and stocks,and also provides an important basis for the decision-making activities of financial investors. |