Date of Award

2025

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Committee Chair

Tathagata Mukherjee

Committee Member

Chaity Banerjee

Committee Member

Aaron Kaulfus

Research Advisor

Tathagata Mukherjee

Subject(s)

Cloud computing, Virtual storage (Computer science) Time-series analysis, Forecasting

Abstract

Large-scale cloud storage systems managing petabyte-scale data face cost challenges due to sparse, irregular file access patterns. Traditional attribute-based methods often fail to capture dynamic temporal and event-driven behaviors. This thesis compares classical statistical and deep learning approaches for predicting file access patterns using five months of real-world access logs from Tsinghua University’s FTP server (2.9 million events). Five forecasting models—ARIMA, SARIMA, Exponential Smoothing, Prophet, and 1D CNNs—are evaluated across 1,161 files. A novel hybrid clustering method combines time series similarity with forecasting accuracy metrics. Results show ARIMA significantly outperforms deep learning (2.3× better accuracy: 0.0188 vs. 0.036 MAE) for hourly forecasts, particularly for high-frequency files. Clustering achieves strong separation (silhouette score: 0.889), identifying four behavioral patterns that support targeted forecasting strategies with 40–60% improved accuracy. These insights enable cluster-specific intelligent data tiering, reducing storage costs by 30–50% while maintaining data accessibility.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.