Font Size: a A A

High-quality Extraction From Web Data By External Resources

Posted on:2013-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:J DuFull Text:PDF
GTID:2218330374967233Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Web2.0, the user-generated contents become more and more in the World Wild Web(WWW), such as "Weibo","online forum" and "Youtube". We call such kind of data as UGC and they are different from the traditional Web data since they are published free by users. Extracting semantic entities from UGC contributes to further researches like meme detection, personalized recommendation. However, the data-quality of UGC is quite low, and therefore traditional techniques of information retrieval will fail on them. The main idea of this paper is utilizing external resources to do the refinement. Based on the low-quality UGC, we use tradition techniques like single-pass or SVM to cluster the semantic entities, and then take the advantage of external resources like Wikipedia, Amazon to refine the previous cluster results. Since the data format of external resources is formal, the latter one will achieve high-quality semantic entities. The main contributions of this paper include:1. We propose a new approach to extract high-quality semantic entities through low-quality UGC by utilizing external resources;2. We analyze how to retrieve external resources efficiently;3. The refined high-quality semantic entities could be used as training datasets to build the model;4. We give the experiment results based on two real datasets, which are "liba" and "Sina Weibo".This paper propose an approach solving the problem that how to extract high-quality semantic entities through low-quality UGC dataset, and from the experiment results, it works.
Keywords/Search Tags:Machine learning, Classification, Imbalanced data set, SVM
PDF Full Text Request
Related items