| With the development of Web 2.0 technology,social media has an important impact on various domains.It has attracted significant interest from both the research and industry community to mine potential knowledge buried in large volume of user-generated content(UGC)on social media,thereby providing decision making support.Relation extraction,one sub-task of information extraction,is aimed at extracting structured information via recognizing predefined relationships between two entities from unstructured texts.Relation extraction plays an important role in various domains,such as ontology construction,question-answer system,information retrieval,and text summarization.However,the unique characteristics of UGC may lead to degraded relation extraction performance.Therefore,it is necessary to propose improved methods considering these UGC characteristics,aiming to improve information quality of generated structured data,as well as provide reliable and real-time information regarding information quality.The objective of this paper is to improve the information quality of generated structured data.Specifically,this paper uses tools and methods about natural language processing and probability theory to conduct research on methods of extracting relations from social media,and the quality of the structured data derived from relation extraction models.Improved relation extraction methods are developed.A framework fusing relation extraction from social media and information quality is proposed.Moreover,a dynamic measurement model pertaining to information quality is developed.Additionally,a research case regarding extracting adverse drug events(ADEs)from social media,which can be formalized as a relation extraction problem,is conducted to verify the effectiveness of the proposed relation extraction methods.The main research contents and originative work of this paper are as follows:(1)A framework managing social-media-based relation extraction and the quality of its derived structured data is developed.The paper analyzes relevant theory about information quality,including information quality dimensions and total quality data management(TDQM)framework.Moreover,common social media platform types are summarized,characteristics of UGC on social media are analyzed,the pipeline of relation extraction from social media is combed through.This paper develops a framework which is comprised of four levels,i.e.,data type level,quality concept level,quality measurement level,and quality improvement level,extending the objects of information quality research and achieveing full information quality management of each relation extraction model and its input and output.(2)A series of relation extraction methods fusing co-training and ensemble learning are proposed.Firstly,feature vectors are constructed for ADE extraction from social media,including feature extraction based on existing studies and feature selection using information gain method.Additionally,some statistical information on three test beds,as well as the composition and frequency distribution of selected features are discussed in detail.This paper conducts research on semi-ensemble learning based on semi-supervised learning and ensemble learning,proposes a series of co-ensemble methods,including Co-Bagging,Co-Boosting,and Co-RS,by fusing co-traing and ensemble learning methods.The effectiveness of Co-ensemble methods is verified using ADE extraction from social media.Co-ensemble methods solve the problem that diversity among classifiers is limited in existing single-view semi-ensemble learning methods.(3)A novel kernel called POS-SSDP(Part-of-Speech and lexical Semantic similarity based Shortest Dependency Path Kernel),which incorporates lexical semantic similarity and part-of-speech analysis,is proposed.The paper analyzes several existing and well-known kernels,including tree kernel,subset tree kernel,shortest dependency path kernel,and all-paths graph kernel.Advantages and disadvantages of these kernels,as well as their derivatives are summarized based on existing studies.Moreover,a kernel ensemble framework,which combines different kernels using different combination methods,is developed.Additionally,this paper proposes POS-SSDP by fusing shortest dependency path,lexical semantic similarity,and part-of-speech analysis.The effectiveness of POS-SSDP is verified using ADE extraction from social media.POS-SSDP can effectively address expression diversity problem on social media,and solve the problem that deep parser is generally not robust when dealing with UGC on social media.(4)A novel dynamic measurement model pertaining to information quality is established.Based on the analysis about organizations of data in relational database,and common relational algebras,the paper analyzes the static propagation model regarding quality assessment of information view derived from relation algebras.A dynamic development model is established by incorporating time variable.Specifically,firstly,this paper constructs timeliness matrix based on data item timeliness,and defines the methods of measuring attribute timeliness and tuple timeliness.Subsequently,“Selection” relational algebra theory is utilized to construct the mapping between out-of-date tuples at time in original quality groups and derived new quality groups.In this paper,this mapping is only for the specific case that the selection condition is applied to a nonidentifier attribute,and the out-of-date data item is applied to a nonidentifier attribute.Lastly,the occurrence probability of each mapping condition is computed using probability theory,thereby measuring the information quality at time.Customer information is used as an example to illustrate the working process of the developed dynamic model,as well as verify the feasibility of the model.The proposed dynamic model in this paper extends the existing static propagation model,achieves dynamic information quality management,and can provide quantitative decision support for decision makers. |