Shift Bioscience refines metric calibration to improve AI virtual cell performance
Shift Bioscience has published new research presenting an improved framework for evaluating metric calibration in AI virtual cell models, showing that better-calibrated metrics reveal consistent outperformance over traditional baselines.
The Cambridge-based biotech, which applies machine learning to understand cell rejuvenation, reported that incidents of poor model performance in earlier studies are often the result of miscalibrated evaluation metrics, rather than model deficiencies. The work challenges recent claims that AI virtual cells perform no better than uninformative baselines, such as mean or linear models.
The study focused on genetic perturbation response models, a subset of virtual cell systems designed to predict cellular reactions to gene up- or down-regulation. Such models are increasingly used to accelerate target identification in drug discovery by reducing the need for wet-lab experiments.
Using 14 publicly available perturb-seq datasets, Shift Bioscience’s researchers showed that widely used evaluation metrics often fail to distinguish robust predictions from noise, particularly in weaker perturbation datasets. They then developed a refined framework introducing rank-based and differentially expressed gene (DEG)-aware metrics that more accurately capture model performance.
When evaluated under this improved framework, virtual cell models consistently outperformed standard baselines, providing evidence that well-calibrated metrics are essential to reveal meaningful biological signal. According to the researchers, this approach enables clearer differentiation between informative and uninformative predictions, improving the reliability of AI-driven target discovery.
Henry Miller, head of machine learning at Shift Bioscience, said: “This latest research from our talented team provides clear evidence that the reports of poor performance in AI virtual cells is largely due to limitations of metrics, not due to issues with the models. We showed that when models are evaluated on well-calibrated metrics, they perform quite well and consistently outperform key baselines.”
He added that the findings could broaden adoption of AI virtual cell approaches across drug discovery and ageing research pipelines. “We believe that this work opens the door to more widespread use of virtual cells and reinforces our confidence in the virtual cell models that are helping to drive our target identification programme for cell rejuvenation,” Miller said.
The study’s results underline the importance of metric selection in benchmarking AI biological models. Shift Bioscience’s findings suggest that when metrics are properly calibrated, AI virtual cell systems can reliably detect biologically relevant signals, offering a scalable route to accelerate the identification of therapeutic targets in complex cellular systems.




