Dataset Overview¶
Introduction¶
This section summarizes the datasets included in the Chemical Data Extraction Benchmark (ChemX) — a curated collection designed to support training and evaluation of models for multimodal information extraction from chemistry publications.
The datasets span a wide range of chemical subdomains and include annotations from text, tables, and visual content such as plots and chemical diagrams.
Data Quality
All datasets were manually annotated and rigorously cross-validated by chemistry experts to ensure high accuracy and consistency.
Data Types and Sources¶
The datasets were constructed from full-text PDFs of peer-reviewed articles, combining both automated extraction and manual correction. Each dataset may include:
- Experimental values (e.g., MIC, logP, lgK, catalytic constants)
- Chemical identifiers and structures (e.g., SMILES, compound names)
- Tabular and visual content (figures, plots, spectra, diagrams)
- Source metadata (DOI, title, authors, journal, year, accessibility)
Data originated from:
- Main article bodies
- Supplementary materials
- Structured tables and unstructured figures
- OCR and model-assisted extraction workflows
Multimodal Composition
ChemX datasets contain text, table, and figure-based data, enabling the evaluation of models that process diverse input formats.
Dataset Structure and Organization¶
Each dataset is provided in a structured tabular format (CSV or Parquet), and is accompanied by:
- Full provenance metadata
- A detailed schema describing fields and units
- Validation outputs (where applicable)
- A Croissant metadata file for interoperability via Hugging Face
Supporting Documentation
All datasets include source article lists, annotation guidelines, and usage notes.
Summary Table of Datasets¶
Dataset Name | Domain | Records | Expert Validation | Link |
---|---|---|---|---|
Cytotoxicity | Nanomaterials | 5476 | ✅ | Learn more |
SelTox | Nanomaterials | 3244 | ✅ | Learn more |
Synergy | Nanomaterials | 3226 | ✅ | Learn more |
Nanozymes | Nanomaterials | 1135 | ✅ | Learn more |
Magnetic nanomaterials | Nanomaterials | 2578 | ✅ | Learn more |
Benzimidazoles | Small molecules | 1721 | ✅ | Learn more |
Oxazolidinones | Small molecules | 2923 | ✅ | Learn more |
Chelate Complexes | Small molecules | 907 | ✅ | Learn more |
Eye Drops | Small molecules | 163 | ✅ | Learn more |
Co-crystals | Small molecules | 70 | ✅ | Learn more |
How to Use the Datasets¶
The ChemX datasets support a variety of research and development workflows:
- Training and evaluating information extraction systems (e.g., LLMs, OCR, image-text models)
- Developing QSAR models and exploring structure–property relationships
- Benchmarking multimodal AI for chemistry-focused applications
- Supporting tasks in materials design, drug discovery, and toxicity prediction
Benchmark Usage
These datasets are already being used to evaluate state-of-the-art models on real-world chemical extraction tasks.
Access to the Datasets¶
You can access the datasets via:
- Hugging Face Datasets Hub
- The GitHub repository
- Direct downloads (CSV/Parquet)
- An upcoming PyPI Python package for programmatic access
Croissant Files
Hugging Face releases include Croissant metadata files for structured dataset interoperability and schema validation.
Example of Loading a Dataset in Python¶
import pandas as pd
# Direct link to Parquet file
parquet_url = "https://huggingface.co/datasets/ai-chem/Nanozymes/resolve/main/data/train-00000-of-00001.parquet"
# Loading with pandas and pyarrow
df = pd.read_parquet(parquet_url, engine="pyarrow")
df.head()
Summary¶
- ChemX provides a multimodal benchmark covering key chemical subfields
- Datasets are expert-validated and rich in metadata
- Designed for reproducible, scalable training and evaluation of AI models in chemistry
- Fully documented and accessible through open platforms