← Back to projects

Synthetic Data for Variant Calling Benchmarking

Framework for generating synthetic genomics datasets for benchmarking tumor-only somatic variant callers.

Synthetic Data Variant Calling Benchmarking
Synthetic Data for Variant Calling Benchmarking

Overview

Synth4Bench is a framework for generating synthetic genomics datasets designed to support the systematic benchmarking of tumor-only somatic variant calling algorithms.

The project addresses a key challenge in cancer genomics: evaluating variant callers when high-quality ground truth data are limited or unavailable. By generating synthetic datasets with known variants, Synth4Bench enables controlled experiments where sequencing depth, read length, allele frequency and variant characteristics can be adjusted and evaluated.

My work focused on the design and implementation of the synthetic data generation workflow, the benchmarking strategy and the downstream analysis used to compare variant calling tools against known ground truth.

Code and Data

The code is available on GitHub and all data on Zenodo

Manuscript is under review but preprint is available on bioRxiv.

Funding

This work has recieved funding from SYNTHIA project and was developed as part of my PhD research.