High-quality Extraction From Web Data By External Resources

Posted on:2013-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:J Du

Full Text:PDF

GTID:2218330374967233

Subject:Computer software and theory

Abstract/Summary:

With the development of Web2.0, the user-generated contents become more and more in the World Wild Web(WWW), such as "Weibo","online forum" and "Youtube". We call such kind of data as UGC and they are different from the traditional Web data since they are published free by users. Extracting semantic entities from UGC contributes to further researches like meme detection, personalized recommendation. However, the data-quality of UGC is quite low, and therefore traditional techniques of information retrieval will fail on them. The main idea of this paper is utilizing external resources to do the refinement. Based on the low-quality UGC, we use tradition techniques like single-pass or SVM to cluster the semantic entities, and then take the advantage of external resources like Wikipedia, Amazon to refine the previous cluster results. Since the data format of external resources is formal, the latter one will achieve high-quality semantic entities. The main contributions of this paper include:1. We propose a new approach to extract high-quality semantic entities through low-quality UGC by utilizing external resources;2. We analyze how to retrieve external resources efficiently;3. The refined high-quality semantic entities could be used as training datasets to build the model;4. We give the experiment results based on two real datasets, which are "liba" and "Sina Weibo".This paper propose an approach solving the problem that how to extract high-quality semantic entities through low-quality UGC dataset, and from the experiment results, it works.

Keywords/Search Tags:

Machine learning, Classification, Imbalanced data set, SVM

Related items

1	Research On Weighted Extreme Learning Machine Algorithm Based On Imbalanced Data Distribution
2	Research On Classification Algorithms For Imbalanced Dataset
3	An Automatically Filter Algorithm For Imbalanced Data Sets Classification
4	Research On Imbalanced Data Augmentation And Imbalanced Classification Based On Auto-Encoder
5	Research On Imbalanced Data Classification Algorithm Based On Extreme Learning Machine
6	Research On Classification Methods Based On Extreme Learning Machine
7	Research On Extreme Learning Machine For Online Sequential Imbalanced Data Classification
8	Imbalanced Data Classification Algorithm Based On Unsupervised Intelligent Under Sampling Method
9	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
10	Research On The Imbalanced Data Learning