Every clinical researcher who has wrestled with messy electronic health records — missing labs, conflicting codes, inconsistent units — knows the preprocessing bottleneck can dwarf the actual analysis. A direct performance benchmark of three leading AI language models against traditional expert-built pipelines now offers a concrete answer to whether that bottleneck can be automated away without sacrificing predictive quality.
Three major large language models — GPT-4, Claude 3.5 Sonnet, and Gemini 1.5 Pro — were tasked with generating preprocessing scripts across five data cleaning and two feature engineering challenges on three well-established clinical datasets: MIMIC-IV, the eICU Collaborative Research Database, and NHANES 2017–2020. Performance was measured on both the quality of the cleaning itself and the downstream predictive power for in-hospital mortality using XGBoost and logistic regression, quantified by AUROC and F1 score. Claude 3.5 Sonnet led data cleaning with a mean F1 of 0.90, GPT-4 followed at 0.89, and Gemini 1.5 Pro reached 0.85 — all surpassing the conventional rule-based pipeline benchmark.
This finding deserves careful contextualizing. Rule-based pipelines, while labor-intensive, are deterministic and auditable — qualities that regulatory and clinical environments demand. The fact that LLM-generated scripts performed comparably or better on F1 does not automatically translate to deployment-ready solutions; hallucinated imputations or plausible-but-wrong clinical coding decisions could introduce subtle biases invisible to downstream metrics. The study's reliance on publicly available, relatively well-curated datasets also means real-world performance on messier institutional EHRs remains an open question. Still, for research contexts where iterative preprocessing is the norm, this benchmark represents a genuinely useful signal: AI-assisted data wrangling may cut weeks of analyst time while preserving — and potentially improving — model performance. The field should now push toward interpretability and auditability standards before clinical deployment becomes routine.