The Research On Topic Extraction From Web Pages Based On Semantic

Posted on:2012-09-19

Degree:Master

Type:Thesis

Country:China

Candidate:W Zhang

Full Text:PDF

GTID:2178330332495819

Subject:Computer application technology

Abstract/Summary:

When people search on line, they always get several pages on the same object which related to each other through web links. It's significant for us to maintain and identify whose closely associated pages, as a result, people can find the useful information exactly and quickly move to the other related topic which they are interested in. Nowadays, pages are usually maintained through artificial means, so it becomes essential to provide a intelligent recognition technology to satisfy people's need.This article is focused on topic extraction based on semantic information by introducing the ontology library. The thesis presents an algorithm to calculate semantic similarity of each text which is extracted from the web based on the ontology library of WordNet. Then summary the words related semantic information, so that it is able to get more accuracy results for topic extraction. The following is the step of accomplishment.Firstly, information filtering is the essential step, by which the navigation information, images and advertising on the web page are removed. Through a series of operation on pages, the processing of information extraction can avoid the affection of unrelated noises, and obtain an accurate result in the end. After information filtering, we need transform the data form page information to text information, because it's difficult to cluster based on the page form.Secondly, we should make the following processing. A system of coreference resolution is also introduced. It not only avoid unnecessary high-frequency of pronoun phenomenon, but also improves the accuracy in information extraction. Lucene can successfully analyze the structure of the sentence in the text, so a lot of noun phrases, such as names, proper nouns are explained accurately. Lucene can create index for each word, hence, it improves the efficiency and accuracy in information extraction.Thirdly, use the improved Lesk similarity algorithm of WordNet to calculate semantic similarity, then cluster pages into different classes, which makes web pages with different topics cluster to different types of aggregations and reduces the noise of pages with different themes. Moreover, the way of computing the semantic similarity of themes, rather than simply calculating the inner product of keywords makes the semantic information among web pages retained, which also improve the accuracy of information extraction.Finally, we use improved TF*IDF algorithm to extract information. The improved algorithm make up for the deficiency, which reckon without the distribution of characteristic within and between classes.In conclusion, according to the experimental data, the algorithm is feasible and effective.

Keywords/Search Tags:

information extraction, WordNet, TF*IDF, Semantic Similarity

Related items

1	Research On Semantic Similarity Between Words And Between Short Texts Based On WordNet
2	Research And Application Of Wordnet-Based Semantic Similarity Measurement
3	Research Of Multi-Documents Summarization Based On Information Extraction And Semantic Similarity
4	The Research On Topic Extraction From Web Pages Based On Semantic
5	The Research Of Semantic Similarity Between Short Text Based On WordNet
6	Research On Key Technologies Of Ontology Construction Based On WordNet And Its Application In Security Domain
7	Research Of English Sentence Similarity Measure Based On Wordnet
8	Multiple Semantic-based Similarity And Relatedness Measurements In WordNet
9	Research On Method Of Semantic Similarity Based On Information Content
10	Research And Implementation Of Semantic Similarity Computing By Combining Knowledge-based And Corpus-based Methods