← Back to projects

LLM Evaluation for Reproducible AI Reporting

Benchmarking LLM-generated AI method annotations against expert human-curated annotations to evaluate reproducibility and reporting quality in life science AI publications.

LLMs AI Evaluation Reproducibility Benchmarking
LLM Evaluation for Reproducible AI Reporting

Overview

DOME Copilot is a large language model (LLM)-based system designed to automatically generate structured AI method reports from scientific manuscripts following the DOME recommendations for machine learning reporting in biology.

My work focused on the evaluation and benchmarking of LLM-generated outputs against expert human annotations in order to assess the quality, consistency and reliability of automatically extracted AI methodology metadata.

The project involved:

  • benchmarking generated annotations against manually curated DOME annotations
  • evaluating semantic similarity between LLM outputs and human expert annotations
  • analyzing reporting quality across AI method disclosure categories
  • assessing scalability of automated literature annotation pipelines
  • studying hallucination avoidance and information extraction performance

Code

DOME Copilot Data Analysis can be found here.

Preprint available on bioRxiv.

Funding

This work was supported by ELIXIR, AI4EOSC, EVERSE and ELIXIR STEERS initiatives supporting trustworthy and reproducible artificial intelligence in life sciences.