Causal Knowledge in Data Fusion Subject to Latent Confounding and Measurement Error
Jingyi Yu,Tim Pychynski,Marco F. Huber
TLDR
It is shown that the machine learning-based fusion strategy achieves the best prediction quality when data are independent and identically distributed, but in the presence of latent confounding, the causality-based fusion strategy makes prediction models more robust against severe distribution shifts.
Abstract
Data fusion is the process of integrating data from multiple sources to produce more accurate and reliable information. It is often the case that data are subject to latent confounding and measurement error in real-world scenarios. In this paper, we evaluate fusion strategies based on different levels of contained causal knowledge to solve quality prediction under varied conditions of latent confounding and measurement error. We show that the machine learning-based fusion strategy achieves the best prediction quality when data are independent and identically distributed (i.i.d.). However, in the presence of latent confounding, the causality-based fusion strategy makes prediction models more robust against severe distribution shifts. Moreover, the out-of-distribution (OOD) generalizability of prediction models is also affected by measurement error in the data. If causal knowledge needs to be inferred from data by applying causal discovery methods, we demonstrate that measurement error can adversely impair causal discovery. We advocate that caution needs to be exercised when using standard causal discovery methods if the circumstances under which the data were generated are unknown.
