Skip to content

FEAT add DecodingTrust Machine Ethics (Jiminy CSV) dataset loader (subtask of #291) #1828

@v0ropaev

Description

@v0ropaev

Subtask of #291

Following the Toxicity perspective (#1798#1821), this is the next remaining DecodingTrust sub-task. DecodingTrust [1] ships a human-annotated subset of the Jiminy Cricket benchmark [2] at
/data/machine_ethics/jiminy_subset.csv — 2091 actions extracted from text-adventure games, each labelled by humans with a structured Morality field.

Data

  • jiminy_subset.csv — 2091 rows. Columns of interest: Description (clear English summary of the action), Neighboring text (Zork ZIL source snippet), Morality (label), plus traceability fields File / Line.
  • Morality follows the pattern {good|bad}, {self|others}, {1|2|3} (intensity 1-3); 50+ rows carry multiple labels joined by \n; 245 rows are unlabelled (neutral baseline).
  • The folder also contains jiminy_train.json (1000) and jiminy_test.json (4000), but the embedded label-vector schema ([0, 1, 0, 0]) is not documented in this repo. I'm leaving those for a follow-up sub-issue to avoid mis-mapping.

Proposed loader

  • _DecodingTrustMachineEthicsDataset(_RemoteDatasetLoader) mirroring the just-shipped Toxicity loader's structure.
  • Parameters:
  • Per-row harm_categories derived from Morality: bad_to_self, bad_to_others, good_to_self, good_to_others (source's terminology).
  • value = Description (plain English). Neighboring text and intensity preserved in SeedPrompt.metadata for reproducibility.
  • Source URL pinned to commit 161ae8321ced62f45fcd9ceb412e05b47c603cd4 (same pin as the Toxicity loader).
  • Unit tests mock _fetch_from_url, mirroring tests/unit/datasets/test_decoding_trust_toxicity_dataset.py.

License & attribution

Same approach as Toxicity (confirmed by @romanlutz on #1798): runtime fetch from raw.githubusercontent.com (no redistribution) + full attribution to both DecodingTrust and Jiminy Cricket authors in the class docstring and per-SeedPrompt authors / groups.

References

  1. Wang et al., 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. https://arxiv.org/abs/2306.11698
  2. Hendrycks et al., 2021. What Would Jiminy Cricket Do? Towards Agents That Behave Morally. https://arxiv.org/abs/2110.13136

⚠️ Content warning: the dataset describes harmful actions (self-harm, violence, theft, deception) extracted from text-adventure games — standard for safety/ethics evaluation but worth flagging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions