| With the development of AI technology,more and more jobs will be revolutionised by the evolution of AI.For example,in text generation.In the past people needed reporters,editors,journalists and other professions to edit some sports reports or news stories into text to convey them to the public.Now text generation techniques in artificial intelligence are replacing this process.An important branch of text generation is data-to-text generation.This direction is one of the key research topics in text generation,where the goal is to automatically generate relevant descriptive text based on the input structured data.This type of task requires addressing two challenges: how to select important information from redundant structured data(the content planning phase)and how to correctly describe important information in natural language form(the surface implementation phase).Previous work has pointed out that the main bottleneck is currently the content planning phase.This thesis builds on previous research on models for data-to-text generation and then adds a numerical relationship mining module based on this to address some weaknesses in the content planning phase.Finally the MASS(Masked Sequence to Sequence Pre-training for Language Generation)module is introduced.The numerical relationship mining module is designed to calculate the intrinsic links between non-adjacent pairs of data in structured data,thus enhancing the textual description of the data and ensuring that the maximum amount of data is not missed.The MASS module was introduced to re-generate the text output from the model in order to diversify the representation of the text.The main work of this thesis is as follows:1)Pre-processing and vectorisation of data.This thesis uses an information extraction technique to extract suitable content from match summaries for planning.The technique is able to identify candidate pairs of entities(players,teams,cities)and values(points,rebounds,etc.)from the text and then predict the type and relationship of each candidate pair.The model structure of this information extraction technique is specifically designed to predict relationships by integrating three convolutional models and three bidirectional LSTMs(Long Short-Term Memory).The output of this preprocessing system is a tuple in the order(entity,value,record type,H/V).Player names are preprocessed to indicate individual names,and team records are preprocessed to indicate the name of the team’s city and the team itself.The pre-processed data is vectorised through a fully concatenated layer and Re LU(Rectifier Linear Unit).2)Content selection and planning.After the vectorised data is obtained,the whole system selects the features in the data by means of a content selection gate.This is done by weighting the vectorised data through an Attention mechanism and then obtaining a feature matrix.After this it is decided which features can stay.After this,a data relationship mining module is introduced into the system to re-evaluate the relationship value of the data pairs,and if the relationship value is high,the features that were removed are also added to the feature matrix to obtain a feature matrix that can be retained.The system gets the output of the current step by inputting the hidden state of the previous output and the Attention mechanism through a decoder constructed by a pointer network and LSTM.The data relationship mining module is also enabled to evaluate the value of relationships between data.3)Text generation and diversity representation.After obtaining the content planning output from the previous step,the text is decoded,predicted and generated from the replication mechanism by a Bi-LSTM network.The output of the model is then imported into MASS for training,resulting in a more diverse representation of the text. |