Skip to content

Dataset Overview


Introduction

This section summarizes the datasets included in the Chemical Data Extraction Benchmark (ChemX) — a curated collection designed to support training and evaluation of models for multimodal information extraction from chemistry publications.

The datasets span a wide range of chemical subdomains and include annotations from text, tables, and visual content such as plots and chemical diagrams.

Data Quality

All datasets were manually annotated and rigorously cross-validated by chemistry experts to ensure high accuracy and consistency.


Data Types and Sources

The datasets were constructed from full-text PDFs of peer-reviewed articles, combining both automated extraction and manual correction. Each dataset may include:

  • Experimental values (e.g., MIC, logP, lgK, catalytic constants)
  • Chemical identifiers and structures (e.g., SMILES, compound names)
  • Tabular and visual content (figures, plots, spectra, diagrams)
  • Source metadata (DOI, title, authors, journal, year, accessibility)

Data originated from:

  • Main article bodies
  • Supplementary materials
  • Structured tables and unstructured figures
  • OCR and model-assisted extraction workflows

Multimodal Composition

ChemX datasets contain text, table, and figure-based data, enabling the evaluation of models that process diverse input formats.


Dataset Structure and Organization

Each dataset is provided in a structured tabular format (CSV or Parquet), and is accompanied by:

  • Full provenance metadata
  • A detailed schema describing fields and units
  • Validation outputs (where applicable)
  • A Croissant metadata file for interoperability via Hugging Face

Supporting Documentation

All datasets include source article lists, annotation guidelines, and usage notes.


Summary Table of Datasets

Dataset Name Domain Records Expert Validation Link
Cytotoxicity Nanomaterials 5476 Learn more
SelTox Nanomaterials 3244 Learn more
Synergy Nanomaterials 3226 Learn more
Nanozymes Nanomaterials 1135 Learn more
Magnetic nanomaterials Nanomaterials 2578 Learn more
Benzimidazoles Small molecules 1721 Learn more
Oxazolidinones Small molecules 2923 Learn more
Chelate Complexes Small molecules 907 Learn more
Eye Drops Small molecules 163 Learn more
Co-crystals Small molecules 70 Learn more

How to Use the Datasets

The ChemX datasets support a variety of research and development workflows:

  • Training and evaluating information extraction systems (e.g., LLMs, OCR, image-text models)
  • Developing QSAR models and exploring structure–property relationships
  • Benchmarking multimodal AI for chemistry-focused applications
  • Supporting tasks in materials design, drug discovery, and toxicity prediction

Benchmark Usage

These datasets are already being used to evaluate state-of-the-art models on real-world chemical extraction tasks.


Access to the Datasets

You can access the datasets via:

Croissant Files

Hugging Face releases include Croissant metadata files for structured dataset interoperability and schema validation.


Example of Loading a Dataset in Python

import pandas as pd

# Direct link to Parquet file
parquet_url = "https://huggingface.co/datasets/ai-chem/Nanozymes/resolve/main/data/train-00000-of-00001.parquet"

# Loading with pandas and pyarrow
df = pd.read_parquet(parquet_url, engine="pyarrow")

df.head()

Summary

  • ChemX provides a multimodal benchmark covering key chemical subfields
  • Datasets are expert-validated and rich in metadata
  • Designed for reproducible, scalable training and evaluation of AI models in chemistry
  • Fully documented and accessible through open platforms