Tau3 Bench - banking_knowledge

Banking Data And Task Explorer

Open benchmark internals, without digging through JSON.

This page loads the public Tau3 banking database, task labels, document labels, and the open Tau2 telecom task-generation code path. It is meant for reading the benchmark like a dataset engineer, not just running it.

Structured database

Banking DB Tables

Selected record JSON

            

Labeled tasks

Task Explorer

Open Tau2 code

How Telecom Tasks Are Generated

The telecom generator is the clean public example of synthetic task creation in Tau2. It does not ask an LLM to invent tasks. It composes small simulator mutations, creates an oracle fix path, then verifies that path in a fresh telecom environment.

3 task families

mobile data, service, MMS

2,285 generated telecom tasks

tasks_full.json in this checkout

20 single-issue tasks

tasks_small.json

32 expected-transfer tasks

unfixable cases add ACTION reward

What is open

The banking knowledge construction pipeline is described in the paper, but the public repo ships the finished banking tasks and documents. Telecom is different: its task composition code is open under src/tau2/domains/telecom/tasks/.

The mental model

A telecom task is built from root-cause atoms. Each atom says: here is how to break the simulator, here is the expected way to fix it, and here are any extra checks needed to prove the fix. Multi-issue tasks are just multiple atoms applied to the same clean env.

The data objects

  1. BaseTask: one root cause, with init funcs and fix funcs.
  2. SelectionSet: a group where the composer chooses one option or none.
  3. ComposedTask: flattened root causes, init funcs, fixes, and assertions.
  4. Task: the final benchmark JSON the harness can run.

Important files

  • create_tasks.py merges all generated task families and samples the base split.
  • manager.py turns composed root causes into final task JSON.
  • mobile_data_issues.py, service_issues.py, and mms_issues.py define issue families.
  • utils.py contains composition helpers like compose_tasks.

One subtle repo detail

create_tasks.py has code to write a sampled tasks.json split from multi-issue bins. In the local checkout, tasks.json and tasks_full.json are identical, both with 2,285 tasks. So read the code path and the checked-in artifact separately.

Exact pipeline

  1. Issue files define atoms. service_issues.py, mobile_data_issues.py, and mms_issues.py define BaseTask objects. Each one has init_funcs that create the failure and fix_funcs that return the expected tool calls.
  2. Selection sets create the combinatorics. compose_tasks takes the cartesian product of every selection set plus None. That means each issue slot can contribute one root cause, or be absent. Validators keep the family honest: mobile-data tasks must include at least one mobile-data issue; MMS tasks must include at least one MMS issue.
  3. The manager starts from a clean simulator. TaskManager.create_task calls get_environment(), applies set_surrounding to set the user name, phone, and location, then runs every init func. Init assertions are checked during generation, but only real env mutations are saved into initial_state.initialization_actions.
  4. The oracle fix path is generated, not guessed. Fix funcs inspect the broken env and return expected ToolCalls. A tool with requestor: "assistant" is an agent-side backend tool. A tool with requestor: "user" is something the simulated user can do after the agent asks them, like toggling roaming or rebooting the phone.
  5. Unfixable cases are represented explicitly. If any fix func is None, the task becomes an expected failure. The answer key becomes transfer_to_human_agents, and the reward basis becomes ENV_ASSERTION + ACTION.
  6. The final task is assembled. The manager fills the scenario text, ticket text, persona, task id, initialization actions, expected actions, env assertions, and reward basis. Persona is assigned by cycling through None, Easy, and Hard.
  7. The generated task is proved. verify_task creates a fresh telecom env, checks it starts fixed, applies the saved initialization actions, replays the expected fix actions, and finally checks is_fixed plus all env assertions. This is the guardrail that keeps synthetic tasks from becoming nonsense JSON.
BaseTask(
  name="data_mode_off",
  init_funcs=[data_mode_off],          # break the simulator
  fix_funcs=[fix_data_mode_off],       # oracle expected action
)

compose_tasks(selection_sets)
  product(selection_set.tasks + [None] for each selection set)
  validate the combination
  sort selected atoms by name
  flatten init funcs, fix funcs, and extra assertions
  return ComposedTask(...)

TaskManager.create_task(composed_task)
  env = get_environment()
  run set_surrounding(env)
  run init funcs and save non-assertion EnvFunctionCalls
  run fix funcs to create expected ToolCalls
  if any fix func is None, expect transfer_to_human_agents
  fill Task JSON
  verify by replaying init + expected actions in a fresh env

Concrete example: fixable

data_mode_off_task calls a user-env function that turns mobile data off. Its expected fix is the user tool toggle_data. The final assertions require mobile data to be enabled and the speed test to report excellent internet.

Concrete example: unfixable

lock_sim_card_pin_task has fix_funcs=[None]. Tau2 still saves the broken starting state, but the answer key is for the assistant to transfer to human support instead of pretending it can unlock the SIM.

Why this matters for our RL generator

This is the recipe we would reuse for a banking-style generator: define small failure atoms, compose them, let code produce the oracle action path, and verify every task by replaying it against the environment. Banking is the richer trace target; telecom is the public blueprint for building reliable synthetic tasks.

Field definition

Field