DOIONLINE

DOIONLINE NO - IJACEN-IRAJ-DOIONLNE-8556

Publish In
International Journal of Advance Computational Engineering and Networking (IJACEN)-IJACEN
Journal Home
Volume Issue
Issue
Volume-5,Issue-7  ( Jul, 2017 )
Paper Title
Fast Data Clustering and Outlier Detection using K-Means Clustering on Apache Spark
Author Name
Yadigar Erdem, Caner Ozcan
Affilition
Department of Computer Engineering, Karabuk University, Karabuk, Turkey
Pages
86-90
Abstract
The components forming the information society nowadays are seen in all areas of our lives. As computers have a great deal of importance in our lives, the amount of information has begun to gather meaningful and specific qualities. Not only the amount of information is increased, but also the speed of access to information has increased. Large data is the transformed form of all data recovered from different sources such as social media sharing, network blogs, photos, videos, log files, etc. into a meaningful and workable forms. Clustering on Big Data with machine learning methods is very useful. Clustering process allows very similar data to be placed under a group by separating the data into a specific group. Once datasets are divided, outlier detection is used to find fraudulent data. In this study, it is aimed to make data clustering and outlier detection process faster by using Apache Spark technology on Big Data with K-means clustering method. Clustering on Big Data can be time consuming. For this reason, Apache Spark fast cluster computing architecture is used in this study. It is aimed to perform fault tolerant, reliable, consistent and fast clustering process using this technology. The MLlib library of Spark components has a relatively small code size and ease of use. Its goal is to make practical machine learning scalable and useful. K-means method, which is included in the MLlib library used in this study, provides a successful analysis of big data. The results are presented in tables and graphs using sample dataset. Index Terms— Apache Spark, Big Data, K-means Clustering, Outlier Detection.
  View Paper