Loading and Working with Chemx Datasets¶

This section demonstrates a basic usage scenario of one of the ChemX datasets hosted on Hugging Face. The following code is fully reproducible in Google Colab or any Jupyter-based environment.

The demo uses the cytox_NeurIPS_updated_data dataset, which contains cytotoxicity data for various nanomaterials.
This example includes:

Loading the dataset into a pandas DataFrame using the datasets library
Accessing the underlying Parquet file directly
Programmatically downloading and parsing the Croissant metadata description

1. Installing Dependencies¶

!pip install pandas datasets requests pyarrow

2. Loading the Dataset via Hugging Face Datasets¶

To quickly access pre-processed datasets, we use the datasets library. It automatically fetches the .parquet file and converts it to a pandas DataFrame.

from datasets import load_dataset

# Unique dataset identifier on Hugging Face
dataset_id = "ai-chem/Nanozymes"

dataset = load_dataset(dataset_id)
df = dataset["train"].to_pandas()

df.head()

📌 Using datasets provides automatic integration with Croissant metadata.

3. Alternative: Load the Raw Parquet File¶

If you prefer direct control over file access — for example, loading a specific chunk — you can work directly with the raw .parquet file using pandas:

import pandas as pd

# Direct link to Parquet file
parquet_url = "https://huggingface.co/datasets/ai-chem/Nanozymes/resolve/main/data/train-00000-of-00001.parquet"

# Loading with pandas and pyarrow
df = pd.read_parquet(parquet_url, engine="pyarrow")

df.head()

📌 Ensure that the link points to the raw file using /resolve/ instead of /blob/.

4. Download and View Croissant Metadata¶

Each dataset in ChemX includes a Croissant file — a machine-readable schema and metadata description in JSON-LD format.
It is used for structural validation, typing, and metadata inspection.

import requests

# Direct link to Croissant JSON file
url = "https://huggingface.co/api/datasets/ai-chem/Nanozymes/croissant"

# Downloading
response = requests.get(url)

# Saving locally
with open("dataset.croissant.json", "wb") as f:
    f.write(response.content)

5. Load and Inspect Metadata Structure¶

Open the Croissant file and display its contents:

import json

with open("dataset.croissant.json", "r") as f:
    croissant_data = json.load(f)

# View in readable form
print(json.dumps(croissant_data, indent=2))

Optional: You can explore recordSet, field, and dataType to understand the schema.

Summary¶

This basic workflow demonstrates:

Two approaches for accessing tabular data (via datasets and raw file streams)
Integration with open metadata and reproducibility standards (Croissant)
Compatibility between Hugging Face datasets, pandas, and JSON-based ecosystems