Author

Jon C. Rogers

Date of Award

2023

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Committee Chair

Letha Etzkorn

Committee Member

Ramazan Aygun

Committee Member

Harry Delugach

Committee Member

Huaming Zhang

Committee Member

Sampson Gholston

Subject(s)

Semantics--Data processing, Data structures (Computer Science), Time-series analysis--Data processing

Abstract

Semantic deduplication is the process of identifying records that describe or are derivative of the same entity and is a crucial task when merging data pulled at different times and/or from multiple sources. Failure to resolve these duplicates may bias the data and the applications and models that use them. Duplicates may range in nature from exact matches to more interesting cases where the data differs due to syntactic representation, missing or updated information, or adversarial transforms. The implications of duplication may go beyond redundancy to include recognizing critical information updates, error detection, and identifying similar behaviors. This task is particularly difficult for temporal domains with time-series data. Existing techniques focus on domain-specific information, utilizing distance or similarity approaches, and in the case of temporal domains, processing records with a single timestamp. This dissertation proposes an approach that applies a base deduplication technique coupled with one focused on temporal sequence knowledge discovery. The concepts explored include similarity, a generalized rule set focused on key, non-key, and elapsed time equivalencies, discovery of implied temporal event order constraints using subsequence inference and probabilistic determination, and time-weighted graph-based representations. In total, 12 methods were evaluated against three labeled datasets, including the Accumulative Adaptive Sorted Neighborhood Method (ASNM) using Jaccard similarity as a baseline comparison, four proposed methods, and seven hybrid/ensemble methods. The datasets include PlayStation Network (PSN) trophy data and two COVID-19 policy datasets. Applied to PSN, the implications of duplication in terms of cheat detection are shown, while application to COVID-19 policies identifies variability in state responses to the pandemic and associated federal guidelines. All methods based on the proposed approach yielded MCC scores in excess of 0.9, far exceeding the 0.738 maximum performance of ASNM. The methods that scored highest combined a generalized rule set with temporal order constraint discovery for a set of like records using longest common subsequence or probabilistic determinations. These methods yielded an MCC of 0.981 for the PSN dataset and perfect models against both COVID-19 datasets.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.