| Name | ID | Type | Description | Report |
|---|---|---|---|---|
| 400-Word Essay on the Wars of the Roses | wars_of_roses_essay | knowledge | Tests the model's ability to write a concise, informative historical essay with proper structure on a well-known historical topic. | View Report |
| Free-Form Definition of 'Granite' | granite_definition | linguistic | Tests the model's ability to provide freeform word definitions, including translations and examples. | View Report |
| JSON-Schema Comprehensive Word Definition | comprehensive_definition | linguistic | Tests the model's ability to provide structured-JSON word definitions, including example sentences and translation. | View Report |
| Sonnet about Daffodils in Spring | daffodil_sonnet | creative | Tests the model's ability to generate structured poetry following formal constraints while conveying specific imagery and themes. | View Report |
| Generate and Score Poker Hands | poker_hand_scorer | coding | Tests the model's ability to create a basic algorithm involving playing cards, as well as knowledge of the common game of poker. | View Report |
| Firefighter Break Room Dialogue | firefighter_conversation | creative | Tests the model's ability to create authentic dialogue. | View Report |
| Neighborhood Logic Puzzle | neighborhood_puzzle | reasoning | Tests the model's ability to solve a complex logic puzzle by tracking multiple constraints and making deductions. | View Report |