Font Size: a A A

Research On Web Text Mining Based On XML

Posted on:2011-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:J W YaoFull Text:PDF
GTID:2178360305455227Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web mining is a key area of information process,which applys the data miningtechnology to networks in order to analyse and research Web information.Because theinformation on the web is complex,diverse and non-structural,it can be divided into threeaspects including web content mining,web structure mining and web log mining.Web textmining regards as a difficult part of web content mining because of the wide variety form ofweb text.In order to get a full use of web information especially web text information to dodata collection and knowledge discovery,the paper studies the whole process of web textmining and the key technologies it involves.To obtain the text resources on the Web,the paper uses the meta-search enginetechnology instead of using traditional method of document collecting based oncrawl.Compared to the collection of crawls without purpose,meta-search engine can callconcurrently multiple of member search engine to get the comprehensive and accurateresources in a rapid speed,so that it can greatly improve the efficiency of obtaining theresources.It provides a personalized collection tools for Web text resource based on theanalysis and research of Google and Baidu search engine which are the two of the mostpopular search engine in China,including the corresponding settings of search parametersand the identifying the target URLs in the page source of search results.As a specification of data representation and storage,XML has been widely applied inWeb and data mining for its portability,flexibility and formatting features.Compared withthe traditional HTML page encoding format which is mainly used to optimize the way ofinformation display,XML is more emphasis on the separate processing mode of data anddisplay.In this way,we can get a more clearer structure of information that displays on theWeb, a more diverse tools to show the information and can easily update the informtiondynamically.Besides,in the field of web text mining,not only we can transform html pagesto xml format to extract the useful information for web page classification,but also we candirectly extract structure data in web page tables according xsl technology to implementtext mining of certain domain.The paper mainly extracts and stores the text informationwhich is meaningful and continuous in the web page.The Web text in addition to the characteristics of non-structural,but also includingsome information that is unrelated to the text content,so the pre-processing of the databefore mining is very important and essential.It includes the extraction of text content,theclean of text and the word segment of final text.The clean of Web page mainly transformsthe HTML pages to XHTML pages to make the HTML tags in line with the XML syntaxrules.The clean of Web text is to delete the large amount of information which is very common and independent of the text content,which improves the speed and efficiency ofthe next extraction of text.As the key step of the pre-processing,the accuracy of Chineseword segment will directly affect the subsequent mining process.For fully improving theaccuracy of word segment,it needs combine several segment methods and scan the text formore times,which increases the time of word segment but improves the accuracy of patternsor rules in the follow process of text classification,text clustering and association rules.Becase of the large mount of words that generates from word segment, If we apply thewords to the classification process straightly,it not only increases the processing time ofclassification ,but also may reduce the accuracy of classification.Therefore,we need thefurther process to choose the word item that have a greater contribution to the classificationresult and spurm the word item that is less relevant.The paper mainly studies the basictheory and implements of document frequence,information gain,mutual information andx2method.All of these methods choose the frequence that a word item occurs in documentsand between documents as a measure,then calculate the numerial value of contributionaccording to the kinds of math fomula.The paper takes the x2 method for feature measurefor its relatively good effect.Text classification is a core part of the whole mining system.It mainly introduces thecurrently used classification method which includes distance-based classification,navieBayesian classification,decision tree classification and support vector machineclassification.All of these classification method have their own adaptable data objects.Thereare a large number of improvement algorithms based on above basic methods and many ofthem aim at specific application in specific area.The main idea of this paper is, in the wholeprocess of text mining,to get the optimal results of pre-process before classification and docontinuous training experimentation to get the proper classification algorithm depending onspecific text objects,then get a higher accuracy of classification model.According to the research of Web information search,text mining and XMLtechnology,it designs and implements a model of Web text mining.The model mainly usestext classification to analyse the text of the Web text sets.The model mainly includes thefollowing functions:collecting relevant web documents,storing the text in xmlformat,adopting VSM for document representation,adopting x2 method for featureselection,choosing simple distance and knn for text classification,and finally providing thevisual user interface for showing the classification results.Based on the traing corpus andtest corpus the paper make use of the accuracy and recall rate are all about 85 percent.Besides,it has a basic understanding of basic theory and implement method of relevanttechnology as text summary and text clustering,which is helpful to complete and perfect theentire web text mining system.
Keywords/Search Tags:Webtext mining, text classification, XML, meta-search engine, feature selection
PDF Full Text Request
Related items