Date of Award

2023

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Committee Chair

Letha Etzkorn

Committee Member

Ramazan Aygun

Committee Member

Harry Delugach

Committee Member

Huaming Zhang

Committee Member

Sampson Gholston

Subject(s)

Semantics--Data processing, Data structures (Computer Science), Time-series analysis--Data processing

Abstract

Semantic deduplication is the process of identifying records that describe or are derivative of the same entity and is a crucial task when merging data pulled at different times and/or from multiple sources. Failure to resolve these duplicates may bias the data and the applications and models that use them. Duplicates may range in nature from exact matches to more interesting cases where the data differs due to syntactic representation, missing or updated information, or adversarial transforms. The implications of duplication may go beyond redundancy to include recognizing critical information updates, error detection, and identifying similar behaviors. This task is particularly difficult for temporal domains with time-series data. Existing techniques focus on domain-specific information, utilizing distance or similarity approaches, and in the case of temporal domains, processing records with a single timestamp. This dissertation proposes an approach that applies a base deduplication technique coupled with one focused on temporal sequence knowledge discovery. The concepts explored include similarity, a generalized rule set focused on key, non-key, and elapsed time equivalencies, discovery of implied temporal event order constraints using subsequence inference and probabilistic determination, and time-weighted graph-based representations. In total, 12 methods were evaluated against three labeled datasets, including the Accumulative Adaptive Sorted Neighborhood Method (ASNM) using Jaccard similarity as a baseline comparison, four proposed methods, and seven hybrid/ensemble methods. The datasets include PlayStation Network (PSN) trophy data and two COVID-19 policy datasets. Applied to PSN, the implications of duplication in terms of cheat detection are shown, while application to COVID-19 policies identifies variability in state responses to the pandemic and associated federal guidelines. All methods based on the proposed approach yielded MCC scores in excess of 0.9, far exceeding the 0.738 maximum performance of ASNM. The methods that scored highest combined a generalized rule set with temporal order constraint discovery for a set of like records using longest common subsequence or probabilistic determinations. These methods yielded an MCC of 0.981 for the PSN dataset and perfect models against both COVID-19 datasets.

Recommended Citation

Rogers, Jon C., "Semantic deduplication of redundant and non-conformant data in temporal domains" (2023). Dissertations. 277.
https://louis.uah.edu/uah-dissertations/277

Download

COinS

Dissertations

Semantic deduplication of redundant and non-conformant data in temporal domains

Date of Award

Document Type

Degree Name

Department

Committee Chair

Committee Member

Committee Member

Committee Member

Committee Member

Subject(s)

Abstract

Recommended Citation

Search

Browse

Author Corner

M. Louis Salmon Library

Dissertations

Semantic deduplication of redundant and non-conformant data in temporal domains

Author

Date of Award

Document Type

Degree Name

Department

Committee Chair

Committee Member

Committee Member

Committee Member

Committee Member

Subject(s)

Abstract

Recommended Citation

Share

Search

Browse

Author Corner

M. Louis Salmon Library