← Back to projects

LLM Evaluation for Reproducible AI Reporting

Benchmarking LLM-generated AI method annotations against expert human-curated annotations to evaluate reproducibility and reporting quality in life science AI publications.

LLMs AI Evaluation Reproducibility Benchmarking

LLM Evaluation for Reproducible AI Reporting

Overview

DOME Copilot is a large language model (LLM)-based system designed to automatically generate structured AI method reports from scientific manuscripts following the DOME recommendations for machine learning reporting in biology.

My work focused on the evaluation and benchmarking of LLM-generated outputs against expert human annotations in order to assess the quality, consistency and reliability of automatically extracted AI methodology metadata.

The project involved:

benchmarking generated annotations against manually curated DOME annotations
evaluating semantic similarity between LLM outputs and human expert annotations
analyzing reporting quality across AI method disclosure categories
assessing scalability of automated literature annotation pipelines
studying hallucination avoidance and information extraction performance

Code

DOME Copilot Data Analysis can be found here.

Preprint available on bioRxiv.

Funding

This work was supported by ELIXIR, AI4EOSC, EVERSE and ELIXIR STEERS initiatives supporting trustworthy and reproducible artificial intelligence in life sciences.

LLM Evaluation for Reproducible AI Reporting

Overview

Code

Related Publication

Funding