Subtask of #291
Following the Toxicity perspective (#1798 → #1821), this is the next remaining DecodingTrust sub-task. DecodingTrust [1] ships a human-annotated subset of the Jiminy Cricket benchmark [2] at
/data/machine_ethics/jiminy_subset.csv — 2091 actions extracted from text-adventure games, each labelled by humans with a structured Morality field.
Data
jiminy_subset.csv — 2091 rows. Columns of interest: Description (clear English summary of the action), Neighboring text (Zork ZIL source snippet), Morality (label), plus traceability fields File / Line.
Morality follows the pattern {good|bad}, {self|others}, {1|2|3} (intensity 1-3); 50+ rows carry multiple labels joined by \n; 245 rows are unlabelled (neutral baseline).
- The folder also contains
jiminy_train.json (1000) and jiminy_test.json (4000), but the embedded label-vector schema ([0, 1, 0, 0]) is not documented in this repo. I'm leaving those for a follow-up sub-issue to avoid mis-mapping.
Proposed loader
_DecodingTrustMachineEthicsDataset(_RemoteDatasetLoader) mirroring the just-shipped Toxicity loader's structure.
- Parameters:
- Per-row
harm_categories derived from Morality: bad_to_self, bad_to_others, good_to_self, good_to_others (source's terminology).
value = Description (plain English). Neighboring text and intensity preserved in SeedPrompt.metadata for reproducibility.
- Source URL pinned to commit
161ae8321ced62f45fcd9ceb412e05b47c603cd4 (same pin as the Toxicity loader).
- Unit tests mock
_fetch_from_url, mirroring tests/unit/datasets/test_decoding_trust_toxicity_dataset.py.
License & attribution
Same approach as Toxicity (confirmed by @romanlutz on #1798): runtime fetch from raw.githubusercontent.com (no redistribution) + full attribution to both DecodingTrust and Jiminy Cricket authors in the class docstring and per-SeedPrompt authors / groups.
References
- Wang et al., 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. https://arxiv.org/abs/2306.11698
- Hendrycks et al., 2021. What Would Jiminy Cricket Do? Towards Agents That Behave Morally. https://arxiv.org/abs/2110.13136
⚠️ Content warning: the dataset describes harmful actions (self-harm, violence, theft, deception) extracted from text-adventure games — standard for safety/ethics evaluation but worth flagging.
Subtask of #291
Following the Toxicity perspective (#1798 → #1821), this is the next remaining DecodingTrust sub-task. DecodingTrust [1] ships a human-annotated subset of the Jiminy Cricket benchmark [2] at
/data/machine_ethics/jiminy_subset.csv— 2091 actions extracted from text-adventure games, each labelled by humans with a structuredMoralityfield.Data
jiminy_subset.csv— 2091 rows. Columns of interest:Description(clear English summary of the action),Neighboring text(Zork ZIL source snippet),Morality(label), plus traceability fieldsFile/Line.Moralityfollows the pattern{good|bad}, {self|others}, {1|2|3}(intensity 1-3); 50+ rows carry multiple labels joined by\n; 245 rows are unlabelled (neutral baseline).jiminy_train.json(1000) andjiminy_test.json(4000), but the embedded label-vector schema ([0, 1, 0, 0]) is not documented in this repo. I'm leaving those for a follow-up sub-issue to avoid mis-mapping.Proposed loader
_DecodingTrustMachineEthicsDataset(_RemoteDatasetLoader)mirroring the just-shipped Toxicity loader's structure.morality: Literal["bad", "good", "neutral", "all"] = "bad"— default matches the red-teaming convention agreed for Toxicity on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798.min_intensity: int = 1— keeps rows whose max label intensity is at or above the threshold (1-3).harm_categoriesderived fromMorality:bad_to_self,bad_to_others,good_to_self,good_to_others(source's terminology).value = Description(plain English).Neighboring textand intensity preserved inSeedPrompt.metadatafor reproducibility.161ae8321ced62f45fcd9ceb412e05b47c603cd4(same pin as the Toxicity loader)._fetch_from_url, mirroringtests/unit/datasets/test_decoding_trust_toxicity_dataset.py.License & attribution
Same approach as Toxicity (confirmed by @romanlutz on #1798): runtime fetch from
raw.githubusercontent.com(no redistribution) + full attribution to both DecodingTrust and Jiminy Cricket authors in the class docstring and per-SeedPromptauthors/groups.References