Location

Huntsville (Ala.)

Start Date

6-7-2017

Presentation Type

Paper

Description

Hadoop implements its own file system HDFS (Hadoop Distributed File System) designed to run on commodity hardware to store and process large data sets. Data Spillage is a condition where a data set of higher classification is accidentally stored on a system of lower classification. When deleted, the spilled data remains forensically retrievable due to the fact that file systems implement deletion by merely resetting pointers and marking corresponding space as available. The problem is augmented in Hadoop as it is designed to create and store multiple copies of same data to ensure high availability, thereby increasing risk to confidentiality of data. This paper proposes three approaches to eliminate such risk. In the first approach, the spilled data is securely overwritten with zero and random fills multiple times at the OS level, to render it forensically irretrievable. In the second approach, the Hadoop inbuilt delete function is enhanced to implement a secure deletion mechanism. In the third approach, the hard drives of the data nodes which have spilled data is replaced with new ones after destroying the old drives. This paper also evaluates all 3 approaches to arrive at an optimal solution which is implementable in a large scale production environment.

Share

COinS
 
Jun 7th, 12:00 AM

Date Spillage Remediation Techniques in Hadoop

Huntsville (Ala.)

Hadoop implements its own file system HDFS (Hadoop Distributed File System) designed to run on commodity hardware to store and process large data sets. Data Spillage is a condition where a data set of higher classification is accidentally stored on a system of lower classification. When deleted, the spilled data remains forensically retrievable due to the fact that file systems implement deletion by merely resetting pointers and marking corresponding space as available. The problem is augmented in Hadoop as it is designed to create and store multiple copies of same data to ensure high availability, thereby increasing risk to confidentiality of data. This paper proposes three approaches to eliminate such risk. In the first approach, the spilled data is securely overwritten with zero and random fills multiple times at the OS level, to render it forensically irretrievable. In the second approach, the Hadoop inbuilt delete function is enhanced to implement a secure deletion mechanism. In the third approach, the hard drives of the data nodes which have spilled data is replaced with new ones after destroying the old drives. This paper also evaluates all 3 approaches to arrive at an optimal solution which is implementable in a large scale production environment.

 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.