Extraction Prompts¶
Overview¶
This page contains the standardized extraction prompts used to build the ChemX benchmark datasets. Each prompt is tailored to extract a specific type of chemical or biomedical information from scientific literature, ensuring semantic consistency, structured formatting, and high-quality validation.
All prompts follow the same structure:
- System Message – Defines the assistant’s role and domain knowledge.
- Extraction Protocol – Provides detailed instructions on what to extract, how to handle edge cases, and how to deal with missing or ambiguous information.
- Required Fields – Lists all expected fields with data types and example values.
- Extraction Rules – Custom constraints and logic relevant to each domain.
- Output Format – JSON array format illustrating the expected data structure.
These prompts are used with large language models (LLMs) for automated extraction of structured data from scientific documents. All extracted data is further subjected to manual validation and expert review.
Dataset Prompts by Category¶
Click on any dataset name to jump to its corresponding prompt.
Nanomaterials:¶
Small Molecules:¶
Metal Complexes:¶
Looking for the actual datasets?¶
See the full descriptions on the Datasets Overview page.
Benzimidazole Antibiotics¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in chemistry of small molecules. In particular, your area is antibiotics and their properties."
}
Extraction Protocol¶
Your task is to extract every mention of MIC or pMIC measurements against Staphylococcus aureus and Escherichia coli bacteria for ALL benzimidazole antibiotics from a scientific article and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
compound_id |
string | ID of a molecule within the article, as cited in the text | "5a" , "Compound 3" |
smiles |
string | Full SMILES representation of a benzimidazole antibiotic | |
target_type |
string | Type of measurement, either "MIC" or "pMIC" , exactly as stated |
|
target_relation |
string | One of "=" , "<" , or ">" . If no relation symbol is shown, use "=" |
|
target_value |
number | The numeric value of MIC/pMIC (without quotes) | |
target_units |
string | MIC units | "μg/mL" , "mg/L" |
bacteria |
string | The organism against which MIC/pMIC was measured, named exactly as in the text |
Extraction Rules¶
-
Extract each MIC/pMIC mention as a separate object. If multiple MIC/pMIC are reported for the same compound against different bacteria, list them as separate entries.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If a range is given (e.g., "2–8 μg/mL"), leave it as a range.
-
If a molecule is fully depicted in a figure, write it as a SMILES string. If a molecule is depicted as a scaffold and residues separately in different places of an article, connect them by compound ID into one molecule and write it as a single SMILES string.
-
Extract only measurements with Staphylococcus aureus and Escherichia coli. Record full names, abbreviations, or any related taxonomic identifiers of bacteria.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all MIC or pMIC measurements of benzimidazole antibiotics present in the article.
Output Format¶
[
{
"compound_id": "11h",
"smiles": "O=C(OCC)C1=C(N(C(=O)N(C1C2=C(C=CS2)C)[H])[H])C[N]3C=NC4=C3C=C(C=C4)[N+](=O)[O-]",
"target_type": "MIC",
"target_relation": "<",
"target_value": 1,
"target_units": "mmol/l",
"bacteria": "methicillin-susceptible S. aureus"
},
{
"compound_id": "5a",
"smiles": "CCN1C=C(C(=O)C2=CC(=C(C=C21)N3CCN(CC3)C4=NC=CC(=N4)N)F)C(=O)O",
"target_type": "pMIC",
"target_relation": "<",
"target_value": 2,
"target_units": "μg/mL",
"bacteria": "Escherichia coli"
}
]
Oxazolidinone Antibiotics¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in chemistry of small molecules. In particular, your area is antibiotics and their properties."
}
Extraction Protocol¶
Your task is to extract every mention of MIC or pMIC values for oxazolidinone antibiotics from a scientific article and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
compound_id |
string | ID of a molecule within the article, as cited in the text | "5a" , "Compound 3" |
smiles |
string | Full SMILES representation of an oxazolidinone antibiotic | |
target_type |
string | Type of measurement, either "MIC" or "pMIC" , exactly as stated |
|
target_relation |
string | One of "=" , "<" , or ">" . If no relation symbol is shown, use "=" |
|
target_value |
number | The numeric value of MIC/pMIC (without quotes) | |
target_units |
string | MIC units | "μg/mL" , "mg/L" |
bacteria |
string | The organism against which MIC/pMIC was measured, named exactly as in the text. Record full names, abbreviations, or any related taxonomic identifiers of bacteria. |
Extraction Rules¶
-
Extract each MIC or pMIC mention as a separate object.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If a range is given (e.g., "2–8 μg/mL"), leave it as a range.
-
If a molecule is fully depicted in a figure, write it as a SMILES string. If a molecule is depicted as a scaffold and residues separately in different places of an article, connect them by compound ID into one molecule and write it as a single SMILES string.
-
If multiple measurement types appear for the same compound and bacterium (e.g., MIC₅₀, MIC₉₀), extract each separately.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all MIC or pMIC measurements of oxazolidinone antibiotics present in the article.
Output Format¶
[
{
"compound_id": "12b",
"smiles": "CC1=CC=C(C=C1)C(=O)Nc2ccc(cc2)C(=O)N3CCCCC3=O",
"target_type": "MIC",
"target_relation": "<",
"target_value": 1,
"target_units": "mmol/l",
"bacteria": "methicillin-susceptible S. aureus"
},
{
"compound_id": "5a",
"smiles": "CC1=CC=CC=C1N2C=NC3=CC=CC=C23",
"target_type": "MIC",
"target_relation": "=",
"target_value": 2,
"target_units": "μg/mL",
"bacteria": "Escherichia coli"
}
]
Cocrystals¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in the chemistry of cocrystals and their properties. Your area of expertise includes analyzing cocrystals, their components, and photostability changes."
}
Extraction Protocol¶
Your task is to extract every mention of photostability for co-crystals from a scientific article, and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
name_cocrystal |
string | Name of cocrystal, as cited in the text | "CAR-HCT" , "DMZ-SAC" |
ratio_cocrystal |
string | Molar ratio of the cocrystal components | "2:1" , "0.5:1" |
name_drug |
string | Name of the drug in the cocrystal as cited in the text | "Carvedilol" , "Epalrestat" |
SMILES_drug |
string | Full SMILES representation of drug | |
name_coformer |
string | Name of the coformer in the cocrystal as cited in the text | "Saccharin" , "Oxalic acid" |
SMILES_coformer |
string | Full SMILES representation of coformer | |
photostability_change |
string | One of "decrease" , "does not change" , or "increase" . Trend of photostability for both the cocrystal and the drug |
Extraction Rules¶
-
Extract each photostability mention as a separate object.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If multiple polymorphic forms (e.g., CBZ-SAC Form I, CBZ-SAC Form II) appear for the same drug and coformer in the same ratio, extract each separately.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all mentions of photostability for co-crystals present in the article.
Output Format¶
[
{
"name_cocrystal": "CAR-HCT",
"ratio_cocrystal": "2:1",
"name_drug": "Carvedilol",
"SMILES_drug": "C1=CC(=C(C=C1O)O)C=CC2=CC(=CC(=C2)O)O",
"name_coformer": "Saccharin",
"SMILES_coformer": "O=C(O)CC(O)C(=O)O",
"photostability_change": "decrease"
},
{
"name_cocrystal": "DMZ-SAC",
"ratio_cocrystal": "0.5:1",
"name_drug": "Epalrestat",
"SMILES_drug": "C1=CC(=C(C=C1O)O)C=CC2=CC(=CC(=C2)O)O",
"name_coformer": "Oxalic acid",
"SMILES_coformer": "C(=C/C(=O)O)\\C(=O)O",
"photostability_change": "does not change"
}
]
Complexes - Ga¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in the chemistry of organometallic complexes and their properties."
}
Extraction Protocol¶
Your task is to extract every mention of organometallic complexes and chelate ligands from scientific article, and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
compound_id |
string | ID of a complex within the article, as cited in the text | "L3" , "A31" |
compound_name |
string | Abbreviated or full name of the complex or ligand as cited in the text | "DOTA" , "tebroxime" |
SMILES |
string | Full SMILES representation of ligand environment or single ligand. — If a complete organometallic complex is shown, extract all ligand structures without mentioning the metal. — If only a chelate ligand is shown, extract only its structure. |
|
SMILES_type |
string | One of "ligand" or "environment" |
|
target_value |
number | The numeric value of logarithms of thermodynamic stability constants lgK or logK |
Note on SMILES format: - If a complete organometallic complex is shown, extract all ligand structures without mentioning the metal - For a chelate ligand without a complete organometallic complex, extract only that ligand's structure
Extraction Rules¶
-
Extract each mention of
target_value
(lgK or logK) as a separate object. -
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If a molecule is fully depicted in a figure, write it as a SMILES string. If a molecule is depicted as a scaffold and residues separately in different places of an article, connect them by compound ID or name into one molecule and write it a single SMILES string.
-
If multiple thermodynamic stability constants appear for the same complex or ligand extract each separately.
-
Extract only structures that comply with these rules: 5.1. The complexes must contain Ga as the metal or the ligands must belong to complexes of that metal. 5.2. The complete molecular structure shall be given without errors in it or identifiers. 5.3. Compounds must contain more than one carbon (exclude CO, Me). 5.4. Compounds must not contain polymeric structures, attached biomolecules or carboranes, undefined radicals, undeciphered designations (e.g., amino acids) beyond the simplest abbreviations (i.e., Me, Et, Pr, Bu, Ph, Ac), names of radicals instead of their structure, or incomplete indication of the ligand structure (e.g., L = P, N). 5.5. Compounds must not be reaction intermediate or precursor.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all mentions of organometallic complexes and/or chelate ligands present in the article.
Output Format¶
[
{
"compound_id": "L3",
"compound_name": "DOTA",
"SMILES": "O=C(O)CN(CCN(CC(=O)O)CC(=O)O)CC(=O)O",
"SMILES_type": "ligand",
"target": 21.3
},
{
"compound_id": "A31",
"compound_name": "tebroxime",
"SMILES": "[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC",
"SMILES_type": "environment",
"target": 17.9
}
]
Complexes - Gd¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in the chemistry of organometallic complexes and their properties."
}
Extraction Protocol¶
Your task is to extract every mention of organometallic complexes and chelate ligands from scientific article, and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
compound_id |
string | ID of a complex within the article, as cited in the text | "L3" , "A31" |
compound_name |
string | Abbreviated or full name of the complex or ligand as cited in the text | "DOTA" , "tebroxime" |
SMILES |
string | Full SMILES representation of ligand environment or single ligand | See note below |
SMILES_type |
string | One of "ligand" or "environment" |
|
target_value |
number | The numeric value of logarithms of thermodynamic stability constants lgK or logK |
Note on SMILES format: - If a complete organometallic complex is shown, extract all ligand structures without mentioning the metal (e.g., "COc1cc(C=CC([O-])CC([O-])CC([O-])C=Cc2ccc(O)c(OC)c2)ccc1O. [C-]#[O+].[C-]#[O+].[C-]#[O+].[OH-]") - For a chelate ligand without a complete organometallic complex, extract only that ligand's structure (e.g., 'O=C(O)CN(CCN(CC(CC(=O)O)CC(=O)O)CCN(CC(=O)O)CC(=O)O')
Extraction Rules¶
-
Extract each mention of
target_value
(lgK or logK) as a separate object. -
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If a molecule is fully depicted in a figure, write it as a SMILES string. If a molecule is depicted as a scaffold and residues separately in different places of an article, connect them by compound ID or name into one molecule and write it a single SMILES string.
-
If multiple thermodynamic stability constants appear for the same complex or ligand extract each separately.
-
Extract only structures that comply with these rules: 5.1. The complexes must contain Gd as the metal or the ligands must belong to complexes of that metal. 5.2. The complete molecular structure shall be given without errors in it or identifiers. 5.3. Compounds must contain more than one carbon (exclude CO, Me). 5.4. Compounds must not contain polymeric structures, attached biomolecules or carboranes, undefined radicals, undeciphered designations (e.g., amino acids) beyond the simplest abbreviations (i.e., Me, Et, Pr, Bu, Ph, Ac), names of radicals instead of their structure, or incomplete indication of the ligand structure (e.g., L = P, N). 5.5. Compounds must not be reaction intermediate or precursor.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all mentions of organometallic complexes and/or chelate ligands present in the article.
Output Format¶
[
{
"compound_id": "L3",
"compound_name": "DOTA",
"SMILES": "O=C(O)CN(CCN(CC(=O)O)CC(=O)O)CC(=O)O",
"SMILES_type": "ligand",
"target": 21.3
},
{
"compound_id": "A31",
"compound_name": "tebroxime",
"SMILES": "[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC",
"SMILES_type": "environment",
"target": 17.9
}
]
Complexes - Tc¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in the chemistry of organometallic complexes and their properties."
}
Extraction Protocol¶
Your task is to extract every mention of organometallic complexes and chelate ligands from scientific article, and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
compound_id |
string | ID of a complex within the article, as cited in the text | "L3" , "A31" |
compound_name |
string | Abbreviated or full name of the complex or ligand as cited in the text | "DOTA" , "tebroxime" |
SMILES |
string | Full SMILES representation of ligand environment or single ligand | See note below |
SMILES_type |
string | One of "ligand" or "environment" |
|
target_value |
number | The numeric value of logarithms of thermodynamic stability constants lgK or logK |
Note on SMILES format: - If a complete organometallic complex is shown, extract all ligand structures without mentioning the metal (e.g., "COc1cc(C=CC([O-])CC([O-])CC([O-])C=Cc2ccc(O)c(OC)c2)ccc1O. [C-]#[O+].[C-]#[O+].[C-]#[O+].[OH-]") - For a chelate ligand without a complete organometallic complex, extract only that ligand's structure (e.g., 'O=C(O)CN(CCN(CC(CC(=O)O)CC(=O)O)CCN(CC(=O)O)CC(=O)O')
Extraction Rules¶
-
Extract each mention of
target_value
(lgK or logK) as a separate object. -
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If a molecule is fully depicted in a figure, write it as a SMILES string. If a molecule is depicted as a scaffold and residues separately in different places of an article, connect them by compound ID or name into one molecule and write it a single SMILES string.
-
If multiple thermodynamic stability constants appear for the same complex or ligand extract each separately.
-
Extract only structures that comply with these rules: 5.1. The complexes must contain Tc as the metal or the ligands must belong to complexes of that metal. 5.2. The complete molecular structure shall be given without errors in it or identifiers. 5.3. Compounds must contain more than one carbon (exclude CO, Me). 5.4. Compounds must not contain polymeric structures, attached biomolecules or carboranes, undefined radicals, undeciphered designations (e.g., amino acids) beyond the simplest abbreviations (i.e., Me, Et, Pr, Bu, Ph, Ac), names of radicals instead of their structure, or incomplete indication of the ligand structure (e.g., L = P, N). 5.5. Compounds must not be reaction intermediate or precursor.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all mentions of organometallic complexes and/or chelate ligands present in the article.
Output Format¶
[
{
"compound_id": "L3",
"compound_name": "DOTA",
"SMILES": "O=C(O)CN(CCN(CC(=O)O)CC(=O)O)CC(=O)O",
"SMILES_type": "ligand",
"target": 21.3
},
{
"compound_id": "A31",
"compound_name": "tebroxime",
"SMILES": "[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC",
"SMILES_type": "environment",
"target": 17.9
}
]
Complexes - Lu¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in the chemistry of organometallic complexes and their properties."
}
Extraction Protocol¶
Your task is to extract every mention of organometallic complexes and chelate ligands from scientific article, and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
compound_id |
string | ID of a complex within the article, as cited in the text | "L3" , "A31" |
compound_name |
string | Abbreviated or full name of the complex or ligand as cited in the text | "DOTA" , "tebroxime" |
SMILES |
string | Full SMILES representation of ligand environment or single ligand | See note below |
SMILES_type |
string | One of "ligand" or "environment" |
|
target_value |
number | The numeric value of logarithms of thermodynamic stability constants lgK or logK |
Note on SMILES format: - If a complete organometallic complex is shown, extract all ligand structures without mentioning the metal (e.g., "COc1cc(C=CC([O-])CC([O-])CC([O-])C=Cc2ccc(O)c(OC)c2)ccc1O. [C-]#[O+].[C-]#[O+].[C-]#[O+].[OH-]") - For a chelate ligand without a complete organometallic complex, extract only that ligand's structure (e.g., 'O=C(O)CN(CCN(CC(CC(=O)O)CC(=O)O)CCN(CC(=O)O)CC(=O)O')
Extraction Rules¶
-
Extract each mention of
target_value
(lgK or logK) as a separate object. -
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If a molecule is fully depicted in a figure, write it as a SMILES string. If a molecule is depicted as a scaffold and residues separately in different places of an article, connect them by compound ID or name into one molecule and write it a single SMILES string.
-
If multiple thermodynamic stability constants appear for the same complex or ligand extract each separately.
-
Extract only structures that comply with these rules: 5.1. The complexes must contain Lu as the metal or the ligands must belong to complexes of that metal. 5.2. The complete molecular structure shall be given without errors in it or identifiers. 5.3. Compounds must contain more than one carbon (exclude CO, Me). 5.4. Compounds must not contain polymeric structures, attached biomolecules or carboranes, undefined radicals, undeciphered designations (e.g., amino acids) beyond the simplest abbreviations (i.e., Me, Et, Pr, Bu, Ph, Ac), names of radicals instead of their structure, or incomplete indication of the ligand structure (e.g., L = P, N). 5.5. Compounds must not be reaction intermediate or precursor.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all mentions of organometallic complexes and/or chelate ligands present in the article.
Output Format¶
[
{
"compound_id": "L3",
"compound_name": "DOTA",
"SMILES": "O=C(O)CN(CCN(CC(=O)O)CC(=O)O)CC(=O)O",
"SMILES_type": "ligand",
"target": 21.3
},
{
"compound_id": "A31",
"compound_name": "tebroxime",
"SMILES": "[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC.[C-]#[N+]CC(C)(C)OC",
"SMILES_type": "environment",
"target": 17.9
}
]
Nanozymes¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in nanozymes."
}
Extraction Protocol¶
Your task is to extract every mention of experiments for ALL nanozymes from a scientific article and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
formula |
string | Chemical formula of the nanozyme | "Fe3O4", "CuO" |
activity |
string | Catalytic activity type | "peroxidase", "oxidase", "catalase", "laccase" |
syngony |
string | Crystal unit of the nanozyme | "cubic", "hexagonal", "tetragonal", "monoclinic", "orthorhombic", "trigonal", "amorphous", "triclinic" |
length |
number | Length of nanozyme particle in nanometers | |
width |
number | Width of nanozyme particle in nanometers | |
depth |
number | Depth of nanozyme particle in nanometers | |
surface |
string | Surface molecule | "naked", "poly(ethylene oxide)" |
km_value |
number | Michaelis constant value | |
km_unit |
string | Unit for Michaelis constant | "mM" |
vmax_value |
number | Molar maximum reaction rate value | |
vmax_unit |
string | Unit for maximum reaction rate | "µmol/min", "mol/min" |
reaction_type |
string | Reaction type with substrate and co-substrate | "TMB + H2O2", "ABTS + H2O2" |
c_min |
number | Minimum substrate concentration (mM) | |
c_max |
number | Maximum substrate concentration (mM) | |
c_const |
number | Constant co-substrate concentration | |
c_const_unit |
string | Unit for co-substrate concentration | |
ccat_value |
number | Catalyst concentration in assays | |
ccat_unit |
string | Unit for catalyst concentration | |
ph |
number | pH level of experiments | |
temperature |
number | Temperature in Celsius |
Extraction Rules¶
-
Extract each nanozyme mention as a separate object.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all nanozymes present in the article.
Output Format¶
[
{
"formula": "Fe3O4",
"activity": "peroxidase",
"syngony": "cubic",
"length": 10,
"width": 10,
"depth": 2.5,
"surface": "naked",
"km_value": 0.2,
"km_unit": "mM",
"vmax_value": 2.5,
"vmax_unit": "µmol/min",
"reaction_type": "TMB + H2O2",
"c_min": 0.01,
"c_max": 1.0,
"c_const": 1.0,
"c_const_unit": "mM",
"ccat_value": 0.05,
"ccat_unit": "mg/mL",
"ph": 4.0,
"temperature": 25
},
{
"formula": "CeO2",
"activity": "oxidase",
"syngony": "cubic",
"length": 5,
"width": 5,
"depth": 200,
"surface": "poly(ethylene oxide)",
"km_value": 54.05,
"km_unit": "mM",
"vmax_value": 7.88,
"vmax_unit": "10-8 M s-1",
"reaction_type": "TMB",
"c_min": 0.02,
"c_max": 0.8,
"c_const": 800,
"c_const_unit": "μM",
"ccat_value": 0.02,
"ccat_unit": "mg/mL",
"ph": 5.5,
"temperature": 37
}
]
Nanomag¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in nanomaterials characterization, specifically in magnetic nanoparticles and their physical properties."
}
Extraction Protocol¶
Your task is to extract every mention of magnetic properties for ALL nanoparticles from a scientific article and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
Composition | |||
name |
string | Material name | "BFO", "cobalt iron oxide" |
np_core |
string | Composition of material core | "Gd2O3", "Fe1Fe2O4" |
np_shell |
string | Composition of material shell | "chitosan", "Au1" |
core_shell_formula |
string | Combined core-shell formula | "Cr2O3-Co" |
np_shell_2 |
string | First additional shell layer | "PEG-5000" |
np_shell_3 |
string | Second additional shell layer | "Curcumin" |
Size Measurements | |||
np_hydro_size |
number | Size from DLS (nm) | |
xrd_scherrer_size |
number | Crystal size from XRD (nm) | |
emic_size |
number | Size from electron microscopy (nm) | |
Crystal Structure | |||
crystal_structure_core_shell |
string | Crystallographic structures | "hexagonal, cubic" |
space_group_core |
string | Space group of core | "fd-3m", "p4/mmm" |
space_group_shell |
string | Space group of shell | "fd-3m", "p4/mmm" |
xrd_crystallinity |
string | Crystallinity type | "crystalline", "amorphous" |
Magnetic Properties | |||
squid_sat_mag |
number | Saturation magnetization (emu/g) | |
squid_rem_mag |
number | Remanent magnetization (emu/g) | |
exchange_bias_shift_Oe |
number | Exchange bias (Oe) | |
vertical_loop_shift_M_vsl_emu_g |
number | Vertical loop shift (emu/g) | |
hc_kOe |
number | Coercivity (Oe) | |
Measurement Conditions | |||
squid_h_max |
number | Maximum magnetic field (kOe) | |
zfc_h_meas |
number | ZFC measurement field (kOe) | |
instrument |
string | Experimental instrument | "Quantum Design 7 T SQUID" |
fc_field_T |
number | FC field (Tesla) | |
squid_temperature |
number | SQUID temperature (K) | |
coercivity |
number | Coercivity (kOe) | |
Additional Properties | |||
htherm_sar |
number | Specific absorption rate (W/g) | |
mri_r1 |
number | MRI relaxation rate r1 (mM⁻¹·s⁻¹) | |
mri_r2 |
number | MRI relaxation rate r2 (mM⁻¹·s⁻¹) | |
blocking_temperature_K |
number | Blocking temperature (K) | |
curie_temperature_K |
number | Curie temperature (K) |
Extraction Rules¶
-
Extract each nanoparticle mention as a separate object.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
Unit conversions for magnetic properties:
- For coercivity/exchange bias: 1T = 1000 Oe, 1 mT = 10000 Oe, 1kOe = 1000 Oe
-
For magnetization measurements: 1 A·m²/kg = 1 emu/g, 1 μ₀M(T) = 0.01257 emu/g
-
Preserve signs for exchange bias and vertical loop shift (default to + if not specified).
-
Extract all magnetic nanoparticles present in the article.
Output Format¶
[
{
"name": "Bismuth Ferrite",
"np_core": "BiFeO3",
"np_shell": "chitosan",
"core_shell_formula": "BiFeO3-chitosan",
"np_shell_2": "PEG-5000",
"np_shell_3": "Curcumin",
"np_hydro_size": 120,
"xrd_scherrer_size": 45,
"emic_size": 50,
"crystal_structure_core_shell": "rhombohedral, amorphous",
"space_group_core": "R3c",
"space_group_shell": "P2_1",
"xrd_crystallinity": "partially crystalline",
"squid_sat_mag": 40.5,
"squid_rem_mag": 22.1,
"exchange_bias_shift_Oe": 180,
"vertical_loop_shift_M_vsl_emu_g": 5.6,
"hc_kOe": 3.2,
"squid_h_max": 5.0,
"zfc_h_meas": 1.5,
"instrument": "Quantum Design 7 T SQUID magnetometer",
"fc_field_T": 0.1,
"squid_temperature": 300,
"coercivity": 3.5,
"htherm_sar": 1.2,
"mri_r1": 4.5,
"mri_r2": 5.3,
"blocking_temperature_K": 350,
"curie_temperature_K": 800
}
]
Synergy¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in antimicrobial drug nanoparticle synergy."
}
Extraction Protocol¶
Your task is to extract every mention of nanoparticle properties, drug details, and their synergistic antibacterial effects from a scientific article, and output a JSON array of objects only.
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
Nanoparticle Properties | |||
NP |
string | Nanoparticle name | "Ag", "Au" |
NP_synthesis |
string | Synthesis method | "chemical synthesis" |
NP_concentration_µg_ml |
number | Nanoparticle concentration | |
NP_size_min_nm |
number | Minimum particle size (nm) | |
NP_size_max_nm |
number | Maximum particle size (nm) | |
NP_size_avg_nm |
number | Average particle size (nm) | |
shape |
string | Particle morphology | "spherical", "rod-shaped" |
zeta_potential_mV |
number | Surface charge (mV) | |
Bacterial Information | |||
bacteria |
string | Bacterial species | "Escherichia coli" |
strain |
string | Strain identifier | "ATCC 25922" |
MDR |
string | Multidrug resistance status | "Yes", "No" |
Drug Properties | |||
drug |
string | Antibiotic name | "Ampicillin" |
drug_dose_µg_disk |
number | Drug dosage | |
Experimental Methods | |||
method |
string | Assessment technique | "MIC", "disc_diffusion" |
time_hr |
number | Exposure duration (hours) | |
Activity Measurements | |||
ZOI_drug_mm_or_MIC _µg_m |
number | Drug activity | |
error_ZOI_drug_mm_or_MIC_µg_ml |
number | Drug activity error | |
ZOI_NP_mm_or_MIC_np_µg_ml |
number | NP activity | |
error_ZOI_NP_mm_or_MIC_np_µg_ml |
number | NP activity error | |
ZOI_drug_NP_mm_or_MIC_drug_NP_µg_ml |
number | Combined activity | |
error_ZOI_drug_NP_mm_or_MIC_drug_NP_µg_ml |
number | Combined activity error | |
Synergy Metrics | |||
fold_increase_in_antibacterial_activity |
number | Activity enhancement | |
FIC |
number | Fractional Inhibitory Concentration | |
effect |
string | Interaction type | "synergistic", "additive" |
Additional Parameters | |||
coating_with_antimicrobial_peptide_polymers |
string | Surface modification | |
combined_MIC |
number | Combined MIC (µg/ml) | |
peptide_MIC |
number | Peptide MIC (µg/ml) | |
viability_% |
number | Bacterial survival (%) | |
viability_error |
number | Viability measurement error |
Extraction Rules¶
-
Extract each nanoparticles mention as a separate object.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
Extract all nanoparticles present in the article.
Output Format¶
[
{
"NP": "Ag",
"bacteria": "Pseudomonas aeruginosa",
"strain": "ATCC 27853",
"NP_synthesis": "Green synthesis using Gloeophyllum striatum",
"drug": "Ampicillin",
"drug_dose_µg_disk": 16.0,
"NP_concentration_µg_ml": 32.0,
"NP_size_min_nm": 10.0,
"NP_size_max_nm": 40.0,
"NP_size_avg_nm": 20.0,
"shape": "spherical",
"method": "MIC",
"ZOI_drug_mm_or_MIC _µg_ml": 16.0,
"error_ZOI_drug_mm_or_MIC_µg_ml": 1.40,
"ZOI_NP_mm_or_MIC_np_µg_ml": 32.0,
"error_ZOI_NP_mm_or_MIC_np_µg_ml": 2.43,
"ZOI_drug_NP_mm_or_MIC_drug_NP_µg_ml": 8.0,
"error_ZOI_drug_NP_mm_or_MIC_drug_NP_µg_ml": 1.50,
"fold_increase_in_antibacterial_activity": 2.0,
"zeta_potential_mV": -34.0,
"MDR": "R",
"FIC": 0.5,
"effect": "synergistic",
"time_hr": 24.0,
"coating_with_antimicrobial_peptide_polymers": "AP Lysozyme hen egg-white",
"combined_MIC": 12,
"peptide_MIC": 400,
"viability_%": 87.0,
"viability_error": 2.40
}
]
Seltox¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in antimicrobial nanoparticles."
}
Extraction Protocol¶
Your task is to extract information for ALL antimicrobial nanoparticles from a scientific article and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
np |
string | Nanoparticle name | "Ag", "Au", "ZnO" |
coating |
string | Surface coating/modification | "1" for coating, "0" for none |
bacteria |
string | Bacterial strain tested | "Escherichia coli", "Staphylococcus aureus" |
mdr |
number | Multidrug-resistant strain indicator | 1 (MDR), 0 (not MDR) |
strain |
string | Specific strain identifier | "ATCC 25922" |
np_synthesis |
string | Synthesis method | "green_synthesis", "chemical_synthesis" |
method |
string | Assay type | "MIC", "ZOI", "MBC", "MBEC" |
mic_np_µg_ml |
number | Minimum Inhibitory Concentration | μg/mL |
concentration |
number | Concentration for ZOI | μg/mL |
zoi_np_mm |
number | Zone of Inhibition | mm |
np_size_min_nm |
number | Minimum nanoparticle size | nm |
np_size_max_nm |
number | Maximum nanoparticle size | nm |
np_size_avg_nm |
number | Average nanoparticle size | nm |
shape |
string | Morphology | "spherical", "triangular" |
time_set_hours |
number | Experiment duration | hours |
zeta_potential_mV |
number | Surface charge | mV |
solvent_for_extract |
string | Solvent used in green synthesis | "water", "ethanol" |
temperature_for_extract_C |
number | Extract preparation temperature | °C |
duration_preparing_extract_min |
number | Extract preparation time | minutes |
precursor_of_np |
string | Chemical precursor | "AgNO3" |
concentration_of_precursor_mM |
number | Precursor concentration | mM |
hydrodynamic_diameter_nm |
number | Hydrodynamic size | nm |
ph_during_synthesis |
number | pH of synthesis solution |
Extraction Rules¶
-
Extract solvents and precursors as strings without parsing into molecular components.
-
Extract each nanoparticle mention as a separate object.
-
Do not filter, group, summarize, or deduplicate. Include repeated mentions and duplicates if they occur in different contexts.
-
If you cannot find a required field for an object, re-check the context; if it's still absent, set that field's value to
"NOT_DETECTED"
. -
The example output shows only two extracted samples, however your output should contain all nanoparticles present in the article.
Output Format¶
[
{
"np": "Ag",
"coating": "0",
"bacteria": "Enterococcus faecalis",
"mdr": 0,
"strain": "ATCC 29212",
"np_synthesis": "Green synthesis using Ixora brachypoda",
"method": "MIC",
"mic_np_µg_ml": 32.0,
"concentration": 10,
"zoi_np_mm": 15,
"np_size_min_nm": 10.0,
"np_size_max_nm": 40.0,
"np_size_avg_nm": 20.0,
"shape": "spherical",
"time_set_hours": 24,
"zeta_potential_mV": -27.9,
"solvent_for_extract": "water",
"temperature_for_extract_C": 21.0,
"duration_preparing_extract_min": 1440,
"precursor_of_np": "AgNO3",
"concentration_of_precursor_mM": 1.0,
"hydrodynamic_diameter_nm": 55,
"ph_during_synthesis": 8.5
},
{
"np": "ZnO",
"coating": "0",
"bacteria": "Klebsiella pneumoniae",
"mdr": 1,
"strain": "K-36",
"np_synthesis": "Green synthesis using Phyllanthus emblica",
"method": "MIC",
"mic_np_µg_ml": 6.25,
"concentration": 64,
"zoi_np_mm": 12,
"np_size_min_nm": 20.0,
"np_size_max_nm": 20.0,
"np_size_avg_nm": 20.0,
"shape": "spherical",
"time_set_hours": 24.0,
"zeta_potential_mV": -32,
"solvent_for_extract": "methanol",
"temperature_for_extract_C": 60,
"duration_preparing_extract_min": 60,
"precursor_of_np": "Zn(NO3).6.H2O",
"concentration_of_precursor_mM": 10,
"hydrodynamic_diameter_nm": 30,
"ph_during_synthesis": 7.0
}
]
Cytotox¶
System Message¶
{
"description": "You are a domain-specific chemical information extraction assistant.",
"instructions": "You specialize in cytotoxic nanoparticles."
}
Extraction Protocol¶
Your task is to extract information for ALL cytotoxic nanoparticles from a scientific article and output a JSON array of objects only (no markdown, no commentary, no extra text).
Required Fields¶
Field | Type | Description | Example |
---|---|---|---|
Material Properties | |||
material |
string | Nanoparticle composition | "SiO2", "Ag" |
shape |
string | Physical shape | "Sphere", "Rod" |
coat_functional_group |
string | Surface coating | "CTAB", "PEG" |
synthesis_method |
string | Synthesis method | "Precipitation", "Commercial" |
surface_charge |
string | Surface charge type | "Negative", "Neutral", "Positive" |
Size Measurements | |||
core_nm |
number | Primary particle size (nm) | |
size_in_medium_nm |
number | Size in biological medium (nm) | |
hydrodynamic_nm |
number | Size with coatings (nm) | |
Surface Properties | |||
potential_mv |
number | Surface charge (mV) | |
zeta_in_medium_mv |
number | Zeta potential in medium (mV) | |
Cell Parameters | |||
no_of_cells_cells_well |
number | Cell density per well | |
human_animal |
string | Cell origin | "A" (Animal), "H" (Human) |
cell_source |
string | Species/organism | "Rat", "Human" |
cell_tissue |
string | Tissue origin | "Adrenal Gland", "Lung" |
cell_morphology |
string | Cell shape | "Irregular", "Epithelial" |
cell_age |
string | Developmental stage | "Adult", "Embryonic" |
Experimental Conditions | |||
time_hr |
number | Exposure duration (hours) | |
concentration |
number | Material concentration | |
test |
string | Cytotoxicity assay | "MTT", "LDH" |
test_indicator |
string | Measured reagent | "TetrazoliumSalt" |
viability_% |
number | Cell viability percentage |
Extraction Rules¶
- For multiple values:
- Prioritize TEM measurements for core_nm
- Note concentration units from context
-
Extract viability as reported (>100% allowed)
-
Data priority:
- Table data over text
-
Note assumptions for ambiguous data
-
Extract each nanoparticle mention as a separate object.
-
Do not filter, group, summarize, or deduplicate.
-
Use
"NOT_DETECTED"
for missing fields after recheck. -
Include all nanoparticles from the article.
Output Format¶
[
{
"material": "SiO2",
"shape": "Rod",
"coat_functional_group": "PEG",
"synthesis_method": "Precipitation",
"surface_charge": "Negative",
"core_nm": 20.0,
"size_in_medium_nm": 25.0,
"hydrodynamic_nm": 30.0,
"potential_mv": -15.0,
"zeta_in_medium_mv": -10.0,
"no_of_cells_cells_well": 5000.0,
"human_animal": "H",
"cell_source": "Human",
"cell_tissue": "Lung",
"cell_morphology": "Epithelial",
"cell_age": "Adult",
"time_hr": 24.0,
"concentration": 100.0,
"test": "MTT",
"test_indicator": "TetrazoliumSalt",
"viability_%": 85.0
},
{
"material": "Fe3O4",
"shape": "Sphere",
"coat_functional_group": "Dextran",
"synthesis_method": "Thermal Decomposition",
"surface_charge": "Positive",
"core_nm": 10.0,
"size_in_medium_nm": 15.0,
"hydrodynamic_nm": 18.0,
"potential_mv": -30.0,
"zeta_in_medium_mv": -15.0,
"no_of_cells_cells_well": 10000.0,
"human_animal": "A",
"cell_source": "Dog",
"cell_tissue": "Kidney",
"cell_morphology": "Epithelial",
"cell_age": "Adult",
"time_hr": 24.0,
"concentration": 300.0,
"test": "MTT",
"test_indicator": "TetrazoliumSalt",
"viability_%": 115.09
}
]