Event Title
Location
Huntsville (Ala.)
Start Date
6-7-2017
Presentation Type
Paper
Description
Hadoop implements its own file system HDFS (Hadoop Distributed File System) designed to run on commodity hardware to store and process large data sets. Data Spillage is a condition where a data set of higher classification is accidentally stored on a system of lower classification. When deleted, the spilled data remains forensically retrievable due to the fact that file systems implement deletion by merely resetting pointers and marking corresponding space as available. The problem is augmented in Hadoop as it is designed to create and store multiple copies of same data to ensure high availability, thereby increasing risk to confidentiality of data. This paper proposes three approaches to eliminate such risk. In the first approach, the spilled data is securely overwritten with zero and random fills multiple times at the OS level, to render it forensically irretrievable. In the second approach, the Hadoop inbuilt delete function is enhanced to implement a secure deletion mechanism. In the third approach, the hard drives of the data nodes which have spilled data is replaced with new ones after destroying the old drives. This paper also evaluates all 3 approaches to arrive at an optimal solution which is implementable in a large scale production environment.
Recommended Citation
Jantali, Srinivas and Mani, Sunanda, "Date Spillage Remediation Techniques in Hadoop" (2017). National Cyber Summit. 6.
https://louis.uah.edu/cyber-summit/ncs2017/ncs2017papers/6
Date Spillage Remediation Techniques in Hadoop
Huntsville (Ala.)
Hadoop implements its own file system HDFS (Hadoop Distributed File System) designed to run on commodity hardware to store and process large data sets. Data Spillage is a condition where a data set of higher classification is accidentally stored on a system of lower classification. When deleted, the spilled data remains forensically retrievable due to the fact that file systems implement deletion by merely resetting pointers and marking corresponding space as available. The problem is augmented in Hadoop as it is designed to create and store multiple copies of same data to ensure high availability, thereby increasing risk to confidentiality of data. This paper proposes three approaches to eliminate such risk. In the first approach, the spilled data is securely overwritten with zero and random fills multiple times at the OS level, to render it forensically irretrievable. In the second approach, the Hadoop inbuilt delete function is enhanced to implement a secure deletion mechanism. In the third approach, the hard drives of the data nodes which have spilled data is replaced with new ones after destroying the old drives. This paper also evaluates all 3 approaches to arrive at an optimal solution which is implementable in a large scale production environment.