Skip to content

Instantly share code, notes, and snippets.

@zkytony
Last active October 28, 2025 04:00
Show Gist options
  • Select an option

  • Save zkytony/3244953329299934aef5bc12765cca98 to your computer and use it in GitHub Desktop.

Select an option

Save zkytony/3244953329299934aef5bc12765cca98 to your computer and use it in GitHub Desktop.
Validation for POUCT in pomdp-py

Why

To answer the questions, (1) how was the correctness of pomdp_py's implementation of POUCT validated, and (2) does it behave correctly? Prompted by this issue.

How

In the Tiger domain, with initial belief [0.5, 0.5], compare the value at the root of the POUCT search tree built after planning for the first action with the optimal value produced by pomdp-solve's vi pruning algorithm (an optimal solver) on the Tiger domain. The value in POUCT search tree should be an estimate of the optimal value and should be close.

Try this for horizon = 1, 2, 3, 4. (Note, max_depth := horizon - 1)

This is the POUCT configuration

pouct = pomdp_py.POUCT(
    max_depth=#<horizon>-1,
    discount_factor=0.95,
    num_sims=100000,
    exploration_const=50,
    rollout_policy=tiger.agent.policy_model,
    show_progress=True,
)

How to interpret .alphas file: Each row with 2 numbers is an alpha vector; Take dot product of a given belief vector and each alpha vector; The largest value is the optimal value at the given belief.

What

horizon = 1

POUCT search tree

_VNodePP(n=99999, v=-1.000)(depth=0)
├─── ₀listen⟶_QNodePP(n=99942, v=-1.000)
├─── ₁open-left⟶_QNodePP(n=27, v=-34.815)
└─── ₂open-right⟶_QNodePP(n=30, v=-34.000)

vi_pruning output .alphas file:

2
-100.0000000000000000000000000 10.0000000000000000000000000 

1
-1.0000000000000000000000000 -1.0000000000000000000000000 

0
10.0000000000000000000000000 -100.0000000000000000000000000 

Both are -1.0.

horizon = 2

POUCT search tree

_VNodePP(n=99999, v=-2.034)(depth=0)
├─── ₀listen⟶_QNodePP(n=99985, v=-2.034)
│    ├─── ₀tiger-left⟶_VNodePP(n=49805, v=-1.000)(depth=1)
│    │    ├─── ₀listen⟶_QNodePP(n=48254, v=-1.000)
│    │    ├─── ₁open-left⟶_QNodePP(n=5, v=-78.000)
│    │    └─── ₂open-right⟶_QNodePP(n=1546, v=-4.444)
│    └─── ₁tiger-right⟶_VNodePP(n=50178, v=-1.000)(depth=1)
│         ├─── ₀listen⟶_QNodePP(n=49862, v=-1.000)
│         ├─── ₁open-left⟶_QNodePP(n=311, v=-9.807)
│         └─── ₂open-right⟶_QNodePP(n=5, v=-78.000)
├─── ₁open-left⟶_QNodePP(n=10, v=-60.390)
│    ├─── ₀tiger-left⟶_VNodePP(n=2, v=10.000)(depth=1)
│    └─── ₁tiger-right⟶_VNodePP(n=6, v=10.000)(depth=1)
│         ├─── ₀listen⟶_QNodePP(n=2, v=-1.000)
│         ├─── ₁open-left⟶_QNodePP(n=2, v=10.000)
│         └─── ₂open-right⟶_QNodePP(n=2, v=-45.000)
└─── ₂open-right⟶_QNodePP(n=4, v=-92.500)

vi_pruning output .alphas file:

2
-100.9500000000000028421709430 9.0500000000000007105427358 

1
-16.0574999770099999807371205 6.9324999770100008689155402 

1
-1.9500000000000001776356839 -1.9500000000000001776356839 

1
6.9324999770100008689155402 -16.0574999770099999807371205 

0
9.0500000000000007105427358 -100.9500000000000028421709430 

-2.034 and -1.95 are close enough.

horizon = 3

POUCT search tree (partial)

_VNodePP(n=99999, v=2.164)(depth=0)
├─── ₀listen⟶_QNodePP(n=99989, v=2.164)
│    ├─── ₀tiger-left⟶_VNodePP(n=49801, v=3.326)(depth=1)
│    │    ├─── ₀listen⟶_QNodePP(n=49781, v=3.326)
│    │    │    ├─── ₀tiger-left⟶_VNodePP(n=37173, v=6.550)(depth=2)
│    │    │    │    ├─── ₀listen⟶_QNodePP(n=374, v=-1.000)
│    │    │    │    ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│    │    │    │    └─── ₂open-right⟶_QNodePP(n=36796, v=6.550)
│    │    │    └─── ₁tiger-right⟶_VNodePP(n=12606, v=-1.000)(depth=2)
│    │    │         ├─── ₀listen⟶_QNodePP(n=12581, v=-1.000)
│    │    │         ├─── ₁open-left⟶_QNodePP(n=16, v=-38.125)
│    │    │         └─── ₂open-right⟶_QNodePP(n=9, v=-51.111)
│    │    ├─── ₁open-left⟶_QNodePP(n=3, v=-95.000)
│    │    └─── ₂open-right⟶_QNodePP(n=17, v=-38.235)
│    │         ├─── ₀tiger-left⟶_VNodePP(n=6, v=-1.000)(depth=2)
│    │         │    ├─── ₀listen⟶_QNodePP(n=3, v=-1.000)
│    │         │    ├─── ₁open-left⟶_QNodePP(n=2, v=-45.000)
│    │         └─── ₁tiger-right⟶_VNodePP(n=9, v=-1.000)(depth=2)
│    │              ├─── ₀listen⟶_QNodePP(n=7, v=-1.000)

vi_pruning output .alphas file:

2
-101.8525000000000062527760747 8.1475000000000008526512829 

1
-28.3518061957880114221097756 7.2957562110763634066756822 

1
-16.9599999770099998386285733 6.0299999770100010110240873 

1
-4.8628187375409872572618042 4.3201187222526371556341473 

1
2.3097999847116499338994799 2.3097999847116499338994799 

1
4.3201187222526371556341473 -4.8628187375409872572618042 

1
6.0299999770100010110240873 -16.9599999770099998386285733 

1
7.2957562110763634066756822 -28.3518061957880114221097756 

0
8.1475000000000008526512829 -101.8525000000000062527760747 

2.164 and 2.309 are close enough.

horizon = 4

POUCT search tree (partial)

_VNodePP(n=99999, v=1.469)(depth=0)
├─── ₀listen⟶_QNodePP(n=99992, v=1.469)
│    ├─── ₀tiger-left⟶_VNodePP(n=50056, v=2.662)(depth=1)
│    │    ├─── ₀listen⟶_QNodePP(n=50045, v=2.662)
│    │    │    ├─── ₀tiger-left⟶_VNodePP(n=37446, v=6.125)(depth=2)
│    │    │    │    ├─── ₀listen⟶_QNodePP(n=27829, v=6.125)
│    │    │    │    │    ├─── ₀tiger-left⟶_VNodePP(n=23068, v=9.442)(depth=3)
│    │    │    │    │    │    ├─── ₀listen⟶_QNodePP(n=191, v=-1.000)
│    │    │    │    │    │    ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│    │    │    │    │    │    └─── ₂open-right⟶_QNodePP(n=22874, v=9.442)
│    │    │    │    │    └─── ₁tiger-right⟶_VNodePP(n=4759, v=-1.000)(depth=3)
│    │    │    │    │         ├─── ₀listen⟶_QNodePP(n=4551, v=-1.000)
│    │    │    │    │         ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│    │    │    │    │         └─── ₂open-right⟶_QNodePP(n=205, v=-9.317)

ran with num_sims=200k instead of 100k:

_VNodePP(n=199999, v=1.572)(depth=0)
├─── ₀listen⟶_QNodePP(n=199992, v=1.572)
│    ├─── ₀tiger-left⟶_VNodePP(n=99693, v=2.711)(depth=1)
│    │    ├─── ₀listen⟶_QNodePP(n=99659, v=2.711)
│    │    │    ├─── ₀tiger-left⟶_VNodePP(n=74190, v=6.145)(depth=2)
│    │    │    │    ├─── ₀listen⟶_QNodePP(n=56780, v=6.145)
│    │    │    │    │    ├─── ₀tiger-left⟶_VNodePP(n=47076, v=9.371)(depth=3)
│    │    │    │    │    │    ├─── ₀listen⟶_QNodePP(n=218, v=-1.000)
│    │    │    │    │    │    ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│    │    │    │    │    │    └─── ₂open-right⟶_QNodePP(n=46855, v=9.371)
│    │    │    │    │    └─── ₁tiger-right⟶_VNodePP(n=9702, v=-1.000)(depth=3)
│    │    │    │    │         ├─── ₀listen⟶_QNodePP(n=9588, v=-1.000)

vi_pruning output .alphas file:

2
-97.8056900145239325183865731 12.1943099854760674816134269 

1
-3.1749688868496956928311192 5.2204696298701351864224307 

1
0.0716958745996567614611195 3.1762733082314071886287365 

1
1.7955441981194137923694143 1.7955441981194137923694143 

1
3.1762733082314071886287365 0.0716958745996567614611195 

1
5.2204696298701351864224307 -3.1749688868496956928311192 

0
12.1943099854760674816134269 -97.8056900145239325183865731 

Though 1.469 seems slightly off from 1.795, increasing num_sims brought the estimate closer at 1.572.

Conclusion

The estimates are generally close, indicating POUCT in pomdp-py is doing a reasonable job approximating the optimal value in this domain. Though, as horizon increases, the estimate becomes poorer under the same number of simuations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment