WG-P1 How poor data hygiene leads towards false AI models in biomedicine

Name of PI	Dr Wilson Goh
Email Address	wilsongoh@ntu.edu.sg
Project Title	How poor data hygiene leads towards false AI models in biomedicine
Description	A mistrained AI will not only not save the world, it also has the potential to do a lot of harm. In the biomedical setting, this may lead towards wrong diagnosis, prognoses or even wrong treatment, with potentially disastrous consequences. In this study, you will examine 3 malpractices rampant in current biomedical AI practice. Using data garnered from bespoke publications, you will demonstrate the biases generated associated with following malpractices: (1) False normalization (aka test set bias) --- where class imbalance effects are irrevocably locked in and cannot be eradicated. (2) Data leakage --- where a single sample sectioned into multiple samples end up in both training and test sets. And (3) Feature substitutability, where the AI will always do well, regardless of how it is trained or what it was trained with. You will demonstrate that once these issues are resolved, the high reported prediction performance will disappear. We will use this insights from this project to inform the community on how to develop better and more rigorous models.