Date of Award
Doctor of Philosophy (PhD)
Semantics--Data processing, Data structures (Computer Science), Time-series analysis--Data processing
Semantic deduplication is the process of identifying records that describe or are derivative of the same entity and is a crucial task when merging data pulled at different times and/or from multiple sources. Failure to resolve these duplicates may bias the data and the applications and models that use them. Duplicates may range in nature from exact matches to more interesting cases where the data differs due to syntactic representation, missing or updated information, or adversarial transforms. The implications of duplication may go beyond redundancy to include recognizing critical information updates, error detection, and identifying similar behaviors. This task is particularly difficult for temporal domains with time-series data. Existing techniques focus on domain-specific information, utilizing distance or similarity approaches, and in the case of temporal domains, processing records with a single timestamp. This dissertation proposes an approach that applies a base deduplication technique coupled with one focused on temporal sequence knowledge discovery. The concepts explored include similarity, a generalized rule set focused on key, non-key, and elapsed time equivalencies, discovery of implied temporal event order constraints using subsequence inference and probabilistic determination, and time-weighted graph-based representations. In total, 12 methods were evaluated against three labeled datasets, including the Accumulative Adaptive Sorted Neighborhood Method (ASNM) using Jaccard similarity as a baseline comparison, four proposed methods, and seven hybrid/ensemble methods. The datasets include PlayStation Network (PSN) trophy data and two COVID-19 policy datasets. Applied to PSN, the implications of duplication in terms of cheat detection are shown, while application to COVID-19 policies identifies variability in state responses to the pandemic and associated federal guidelines. All methods based on the proposed approach yielded MCC scores in excess of 0.9, far exceeding the 0.738 maximum performance of ASNM. The methods that scored highest combined a generalized rule set with temporal order constraint discovery for a set of like records using longest common subsequence or probabilistic determinations. These methods yielded an MCC of 0.981 for the PSN dataset and perfect models against both COVID-19 datasets.
Rogers, Jon C., "Semantic deduplication of redundant and non-conformant data in temporal domains" (2023). Dissertations. 277.