Font Size: a A A

Detection Of Sensitive Data From Big Data Using Classification Algorithms

Posted on:2021-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Full Text:PDF
GTID:1368330623477485Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Global terrorism has created challenges to the criminal justice system due to its abnormal activities that lead to financial loss,cyberwar,and cyber-crime.So,it is the global challenge to monitor the terroristic group activities by mining the criminal information accurately from big data for the estimation of potential risk at national and international levels.Many conventional methods of computing have successfully implemented but scanty or no literature had found to solve these issues through the use of big data analytical tools and techniques.To fill this literature gap,this research is aimed at the determination of accurate criminal data from a huge mass of varieties of data by the use of Hadoop clusters to support the Social Justice Organization to combat the terroristic activities on a global scale.In this dissertation,several classification algorithms like Neural Network,K-Nearest Neighbor,Term Frequency-Inverse Document Frequency,Latent Semantic Analysis have successfully implemented to obtain significant results that create new ways of thinking for security agencies in combating terrorism at the global scale.Terrorism has gained global attention due to its negative impact on the economy and its effect on a global scale.It is also against human rights and international law.So,it is not tolerated and justified.It must be fought and presented at national and international levels.Terrorists try to achieve social or political objectives through the use of violence and it also tries to gain focus of populous audiences compared to the immediate target.It creates unnecessary pressure on the government for personal or political benefits.In this dissertation,sensitive data refers to the data related to explosions,criminals,violence,burglaries,murder,thieves,and cybercriminals,that are present in big data.Although the study of terrorism and response to terrorism(START)deals with the study of cause and consequences of terrorism,it is the global challenge to observe the terroristic activities to mine criminal information from big data for the assessment of the potential risk to societies,cities,and countries.Big data is the creation,storage,manipulation of data that come from diverse sources like social media,online drives,sensors,transactions,cell phones,data stores,cloud,etc.The collected data are of three types-unstructured,semi-structured,and structured.The big data normally deal with the unstructured data sets.It is also found that there is the exponential growth of the data and the varieties of the large volume of data are accessed with high velocity by different kinds of users.If data and technologies are utilized properly,it provides significant results.On the contrary,if technology and information are misused,then there is a great problem for the human being and society.It is also known that there are several criminal activities are going all around the world.They are sharing their sensitive information using the technology and all the information is present in the cloud,social media in the form of big data.It is difficult for traditional statistical methods of computation to store,process,and analyze such a huge amount of data in a given time frame.In this scenario,the use of big data analytical tools and techniques will be supportive in identifying criminal activities.There are many ways to determine sensitive data from big data.One of the important methods is the classification algorithms.It classifies a large volume of data into the different chunks of data to reduce the size of the data.From the reduced data,it is easier to determine the word that we need to find.The reduced form of data is simpler to store,analyze,and process,and by using different techniques,intended terms can be easily determined.Machine learning has successfully implemented in the anti-terrorism activities to determine the hidden knowledge.The classification algorithms along with the big data analytical tools like Hadoop and Spark have successfully implemented to reduce the huge mass of data into its reduced form to unmask the hidden knowledge about criminal activities.A novel approach sensitive data detection from big data is implemented through Term Frequency-Inverse Document Frequency.In this case,input data is stored in the Hadoop Distributed File System to parallelize.Spark API is used to read the files,dataframes are suitable APIs for different machine learning algorithms and it is an optimized version of resilient distributed data sets.Stanford's NLP is used for lemmatization that uses annotation and annotators.Words that do not carry much meaning to the text are removed by Stop Words.By using TF-IDF,the product of term frequency and inverse document frequency is calculated.Singular value decomposition algorithm has been implemented for data reduction to analyze the data quickly.At last sensitive data are determined from a huge mass of data.Latent semantic analysis analyses the relationship of terms with a set of documents.Big data has a collection of unstructured data sets,which are in different formats.By using traditional techniques and methods,it is difficult to process such a huge amount of data.LSA is used to solve the issue related to the polysemy and synonymy found in information retrieval.LSA uses SVD techniques to segregate the terms and documents from a collection of documents.In this case,a single matrix is decomposed into three matrices i.e.first matrix represents term matrix,the second matrix is a diagonal matrix that represents the strength of the terms present in decreasing order,and the last matrix represents the collection of documents.In the case of the diagonal matrix,there are many zero terms,that are eliminated.The newly obtained matrix is called the truncated SVD.After the implementation of low-rank approximation and cosine similarity,data related to the criminal activities are determined.Neural Network solves the problem by the use of the interconnection of processing elements,which are working in parallel.The typical processing element is called perceptron,which takes many binary inputs and produces a single output,and output is determined with the help of threshold value and is affected by the weight.To detect sensitive data from big data,the neural network plays a vital role.In the case of a single?layer neural network,it takes documents as input and produces sensitive data as output.Documents that contain sensitive data are taken as input in the case of multiple layer neural networks and processing are completed in many hidden layers until the sensitive data cannot be obtained and at last a list of sensitive data can be generated as output.Another method for determining sensitive data from big data is the K Nearest neighbor algorithm.It is the most influential,supervised learning for classification problems.It takes documents containing sensitive data as input and based on the distance of nearest terms,it provides a list of sensitive terms.MapReduce architecture is implemented for the determination of sensitive data.In the map phase,the training data are passed to the map phase for the calculation of distance and the resulted data are fed to the reduce phase.In the reduce phase,key and new HashMap is passed for processing,and similarities of the terms are determined by the distance formula and at last closest terms are generated as output.The other basic methods for the detection of sensitive data are TeraSort,SparkPi,and WordCount,which helps to reduce the size of the data in a specific order.These techniques reduce the size of high voluminous data to low sized data.TeraSort is the sorting technique used in a distributed computing environment with Hadoop tools.This technique sorts all the data in the ascending order and from the list of sorted data,all the sensitive data can be easily determined.SparkPi uses the Monte Carlo methods for the calculation of Pi value and segregates all the data in two different parts where non?sensitive data are in one place while the sensitive data are segregated to the other location.WordCount algorithm generates the frequency of the repeated terms so that the volume of data is reduced.From those terms,sensitive data can be easily determined.Aiming at the problem of how to quickly and efficiently retrieve sensitive data from big data,we have proposed a great deal of work based on big data,sensitive data,and machine learning algorithms.The main achievements of this work are summarized as follows:·Big data analytical tools and techniques have been implemented to get knowledge and understanding about the crime that is addressed and framed through classification algorithms to manage criminal activities and also new ways of thinking have created to support the criminal justice system to rectify critical problems related to terrorism.·A classification algorithm-Back Propagation Neural Network model has been employed in a distributed computing environment with Hadoop clusters for exposure of sensitive information.This model implements a map and reduce function one after another.A combination of these functions exposes sensitive information quickly and it reduces computation time,and its performance is comparatively better.·An effective supervised machine learning K-Nearest Neighbour algorithm has applied to determine the closeness of data in a distributed computing environment to effectively retrieve the sensitive data.By using the distance formulae,the distance between the testing data and training data is calculated and the value of k is chosen to determine the class of criminal information based on the majority of votes achieved.·Latent Semantic Analysis has been proposed for the retrieval of sensitive data by reducing the original document into three matrices i.e.Term matrix,diagonal matrix,and document matrix using Yarn Resource Manager.This algorithm also implements a map and reduce functions.Moreover,to reduce the size of data,we have used Singular Value Decomposition and,later truncated SVD and cosine similarities are used to expose sensitive information.·Many algorithmic approaches including parallelization,annotators and annotations,Lemmatization,Stopword Remover,Term Frequency,and Inverse Document Frequency and Singular Value Decomposition had been successfully implemented to determine accurate criminal data from the huge mass of varieties of data by the use of Hadoop clusters to support Social Justice Organization to combat terroristic activities on a global scale.·Other basic algorithms like TeraSort,SparkPi,WordCount have also used to reduce the size of data so that criminal information can be determined easily.·The efficacy of the work is achieved through several experiments by changing nodes and size of data by keeping all environments same on real-world datasets to retrieve sensitive information with stable performance and creates new ways of thinking for security agencies in combating terrorism at the global scale.Overall this work helps to determine sensitive data from big data using classification algorithms with the help of big data analytical tools and techniques to investigate the nature,place,patterns of crimes,people involved before and after criminal activities,to unearth communication details of criminals during criminal activities.It also helps predictive analytics to unmask sensitive data at the right time so that future unexpected events can be controlled and to find criminal activities even if they change their policies and move quickly.
Keywords/Search Tags:Term frequency-inverse document frequency, Hadoop, Spark, Singular value decomposition, latent semantic indexing, Neural networks, K-nearest neighbor, Sparkpi
PDF Full Text Request
Related items