PacBio-led team publishes ‘Platinum Pedigree’ benchmark to boost AI accuracy in hard-to-sequence genome regions

A newly published benchmark dataset is helping push variant detection deeper into the human genome’s most complex regions—improving the performance of AI-based tools and reshaping standards for both clinical genomics and population-scale research.

The study, published in Nature Methods on August 4, 2025, introduces the Platinum Pedigree benchmark, the most comprehensive family-based variant dataset ever released. Built using long-read sequencing and inheritance-based validation across a 28-member multigenerational family, the resource characterises more than 37 Mb of variation, including structural variants and difficult-to-map regions previously excluded from many truth sets.

Developed by scientists at PacBio, alongside collaborators from the University of Washington, University of Utah, and others, the benchmark significantly enhances the ability to detect genomic variation beyond standard reference regions. The dataset also adds more than 200 million bases to benchmarking coverage, extending into tandem repeats, segmental duplications, and low-complexity regions.

To demonstrate its impact, the team retrained Google DeepVariant, a widely used AI-based variant caller, using the new benchmark data. The result: up to a 34% reduction in false-positive variant calls genome-wide, with even greater gains in the most complex regions.

“This benchmark doesn’t just include simple variants in easy-to-sequence regions—it captures the full spectrum of variation, including areas previously considered too complex to benchmark reliably,” said Zev Kronenberg, lead author and senior manager at PacBio.

The dataset offers:

  • The first pedigree-validated tandem repeat and structural variant truth sets

  • Benchmark regions expanded to 2.77 Gb, including difficult genome areas

  • A framework for evaluating and improving AI/ML-driven pipelines in genomics

  • Immediate relevance to clinical sequencing workflows and population-scale studies

Michael Eberle, vice president of computational biology at PacBio and senior author on the study, said: “This benchmark pushes accuracy where it matters most. It enables better evaluation of variant calling pipelines and accelerates the development of methods that finally reach the full genome, including regions important for human health.”

All data, code, and analysis pipelines have been made publicly available by the Platinum Pedigree Consortium.

The benchmark is already being adopted by researchers developing AI-powered genomics tools, validating clinical sequencing protocols, and building out the next generation of whole-genome reference sets, such as T2T-CHM13.

Mail Icon

news via inbox

Sign up for our newsletter and get the latest news right in your inbox