Font Size: a A A

Knowledge Mining Based On Statistical Snowball Models

Posted on:2012-04-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X J LiuFull Text:PDF
GTID:1118330335962512Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technologies, the World Wide Web has been growing rapidly as a huge knowledge repository, containing various kinds of valuable information about real-world named entities. These named entities contain organizations, locations and persons, covering from celebrities to the everyday individuals. Named entity search engines automatically mine the named entities from Web pages, and summarize knowledge for them based on the their Web appearances, which could be directly returned to users. Compared with the general search engines which can only return the unstructured Web pages, this type of search engines provides faster and more direct user experience, and has become a great research and development area in both industry and research area.In order to build a fast and accurate named entity search engine, deep knowledge mining on named entities from the Web is required. There are three key knowledge mining problems in building named entity search engines: named entity recognition, named entity summarization and named entity relationship mining. Focusing on these three key problems, this dissertation proposes a statistical unsupervised learning framework named StatSnowball, which has overcome the disadvantage of state-of-the-art unsupervised learning models. The main contents and contributions of this dissertation are as follows:1. Discuss the state-of-the-art Web-scale knowledge mining systems. Mainly focus on supervised methods based on the natural language features and the state-of-the-art self-supervised methods based on the extraction patterns. These methods have been widely used in different tasks of knowledge mining. The emphasis of our analysis is the basic idea behind these two types of methods, and typical models.2. Propose an unsupervised learning model: StatSnowball (Statistical Snowball) for the relationship extraction. Our model adopts the bootstrapping framework and uses the general statistical model Markov logic networks as the underlying extraction model. By using the statistical pattern evaluation and selection methods, StatSnowball can incorporate all kinds of patterns. By adopting MLN, StatSnowball accomplishes various levels of joint inference in relationship extraction. Experiments on both small but fully labeled data and large scale Web data have shown the effectiveness of our methods.3. Propose a uniform named entity recognition and relation extraction model based on iterative framework: EntSum. Our model extends conditional random field model used by named entity recognition, which enables relationship features to be added to the model. Joint model adopts the iterative framework to build bidirectional connection between two tasks, in which both results can be used in the other's decision making process. Experiments on the real Web data have shown the increase to the performance on both two tasks.4. Propose an entity summarization model: BioSnowball, which can be considered as an extension to the basic StatSnowball model. By using the Fact-Bio duality, BioSnowball adopts the bootstrapping framework, and starts from only a small set of samples to jointly complete two different types of summarization. Our model can jointly complete the fact extraction and biography ranking for Web entities. Experiments on the real Web data and the user study have shown the effectiveness of our model on both problems. The success of BioSnowball has also shown the generality of the basic StatSnowball model.5. Build two public available named entity search engines named Renlifang and EntityCube, which the author has participated in as the main researcher and developer. These two search engines automatically mine knowledge from billions of Chinese and English Web pages respectively and build an entry page for every extracted entity. StatSnowball has been already applied to the system, and other methods in this dissertation have also been verified under the data of these two real systems.At the end of this dissertation, we conclude paper and prospect the further studies in the future.
Keywords/Search Tags:knowledge mining, named entity search, self-supervised learning, relationship extraction, named entity recognition, named entity summarization
PDF Full Text Request
Related items