Benzimidazole Antibiotics – Chemical Structures and Antimicrobial Activity¶
Original Data¶
Title: Benzimidazole antibiotics – chemical structures and antimicrobial activity
Description: This dataset contains information on the chemical structures of benzimidazole-based antibiotics in SMILES format, along with their antimicrobial activity measured as Minimum Inhibitory Concentration (MIC) values against different Staphylococcus aureus and Escherecia coli strains. In addition to structures and target values, the dataset includes metadata such as source articles, extraction pages, origin type (table, text, image).
Total number of records: 1721
Number of extracted features (columns): 3 (main: SMILES and MIC), though the full table contains over 25 metadata columns
Input data type: Mixed
Domain: Small molecules / antibacterial compounds
Automatic validation: Yes
Extraction Prompt¶
This dataset was created using a specialized prompt for extracting information about benzimidazole antibiotics and their antimicrobial activity. The extraction system was configured to:
System Message: "Domain-specific chemical information extraction assistant specializing in chemistry of small molecules, particularly antibiotics and their properties."
Extraction Parameters:
- Target compounds: Benzimidazole antibiotics
- Target properties: MIC or pMIC measurements
- Test organisms: Staphylococcus aureus and Escherichia coli
- Required fields:
- compound_id
: Molecule identifier within the article
- smiles
: SMILES representation of benzimidazole antibiotic
- target_type
: MIC or pMIC
- target_relation
: =, <, or >
- target_value
: Numeric value
- target_units
: Measurement units
- bacteria
: Test organism name
Extraction Rules: 1. Each MIC/pMIC mention extracted separately 2. No filtering or deduplication 3. Ranges preserved as given 4. Complete SMILES strings constructed from scaffolds and residues 5. Only S. aureus and E. coli measurements included 6. Missing fields marked as "NOT_DETECTED"
Data Scheme¶
Column Descriptions¶
Column Name | Description |
---|---|
smiles |
Extracted chemical structure in isomeric SMILES format |
doi |
DOI of the article from which the data was extracted |
title |
Title of the article |
publisher |
Publisher |
year |
Year of publication |
access |
Access type (1 = open access, 0 = closed access) |
compound_id |
Compound identifier in the article |
target_type |
Type of target measurement (e.g., MIC or pMIC) |
target_relation |
Relation symbol (e.g., =, >, <) |
target_value |
Target value |
target_units |
Units of the target value (e.g., µg/mL) |
bacteria |
Bacterial species against which the compound was tested |
bacteria_unified |
Unified name of the bacterial species (e.g., Escherichia coli) |
page_bacteria |
Page number where the bacterium is mentioned in the article |
origin_bacteria |
Source of the bacterium mention (table, figure, text, image) |
section_bacteria |
Section of the article (if origin_bacteria = text ) |
subsection_bacteria |
Subsection (if origin_bacteria = text ) |
page_target |
Page number where the target value is provided |
origin_target |
Source of the target value (table, figure, text, image) |
section_target |
Section of the article (if origin_target = text ) |
subsection_target |
Subsection (if origin_target = text ) |
page_scaffold |
Page with the chemical scaffold or full molecule diagram |
origin_scaffold |
Origin of the scaffold image (table, figure, or unnumbered image) |
page_residue |
Page with substituent structures (if molecule is scaffold + residues) |
origin_residue |
Origin of the residue image (table, figure, or unnumbered image) |
Metadata¶
Field Name | Description |
---|---|
doi |
DOI of the original article |
title |
Title of the publication |
publisher |
Publisher |
year |
Year of publication |
access |
Access type (0 = closed, 1 = open access) |
Key Notes
- SMILES molecules may be presented:
- As complete structures (use
scaffold
only) - As scaffold + substituents (use both
scaffold
andresidue
columns) - Target values (MIC/pMIC) are extracted with full context: value, units, source, and page
- Bacterial names are provided in both raw and unified formats
- Data origin types include:
table with number
(e.g., table 1)figure with number
(e.g., scheme 1)image
– unnumbered image inside the texttext
– if extracted directly from the article's text- Section/subsection fields are filled only when data is extracted from text or image
Dataset Description¶
The goal of this dataset is to collect and structure chemical and biological data on benzimidazole-based antibiotics and their activity against bacterial pathogens. The primary target property is MIC (Minimum Inhibitory Concentration) expressed in µg/mL.
Each record includes:
- A chemical structure in SMILES format
- Information about the bacterial strain tested
- The target value (type, relation, numeric value, unit)
- Metadata about the source article (DOI, journal, year, publisher, access)
- Detailed information on the source in the article (page, table, figure, text, or image)
The dataset is intended for tasks such as scientific information extraction, QSAR modeling, structure–activity analysis, and antibacterial research involving small molecules.
Validation Results¶
The Benzimidazoles dataset contained
77 corrections (63 pattern-based, 14 isolated),
with the primary error loci being smiles
, target_value
, and compound_id
.
Common mistakes included inconsistent or incomplete SMILES
structures and naming mismatches,
often reflecting variable conventions across original publications.