To answer the questions, (1) how was the correctness of pomdp_py's implementation of POUCT validated, and (2) does it behave correctly? Prompted by this issue.
In the Tiger domain, with initial belief [0.5, 0.5], compare the value at the root of the POUCT search tree built after planning for the first action with the optimal value produced by pomdp-solve's vi pruning algorithm (an optimal solver) on the Tiger domain. The value in POUCT search tree should be an estimate of the optimal value and should be close.
Try this for horizon = 1, 2, 3, 4. (Note, max_depth := horizon - 1)
This is the POUCT configuration
pouct = pomdp_py.POUCT(
max_depth=#<horizon>-1,
discount_factor=0.95,
num_sims=100000,
exploration_const=50,
rollout_policy=tiger.agent.policy_model,
show_progress=True,
)How to interpret .alphas file: Each row with 2 numbers is an alpha vector; Take dot product of a given belief vector and each alpha vector; The largest value is the optimal value at the given belief.
POUCT search tree
_VNodePP(n=99999, v=-1.000)(depth=0)
├─── ₀listen⟶_QNodePP(n=99942, v=-1.000)
├─── ₁open-left⟶_QNodePP(n=27, v=-34.815)
└─── ₂open-right⟶_QNodePP(n=30, v=-34.000)
vi_pruning output .alphas file:
2
-100.0000000000000000000000000 10.0000000000000000000000000
1
-1.0000000000000000000000000 -1.0000000000000000000000000
0
10.0000000000000000000000000 -100.0000000000000000000000000
Both are -1.0.
POUCT search tree
_VNodePP(n=99999, v=-2.034)(depth=0)
├─── ₀listen⟶_QNodePP(n=99985, v=-2.034)
│ ├─── ₀tiger-left⟶_VNodePP(n=49805, v=-1.000)(depth=1)
│ │ ├─── ₀listen⟶_QNodePP(n=48254, v=-1.000)
│ │ ├─── ₁open-left⟶_QNodePP(n=5, v=-78.000)
│ │ └─── ₂open-right⟶_QNodePP(n=1546, v=-4.444)
│ └─── ₁tiger-right⟶_VNodePP(n=50178, v=-1.000)(depth=1)
│ ├─── ₀listen⟶_QNodePP(n=49862, v=-1.000)
│ ├─── ₁open-left⟶_QNodePP(n=311, v=-9.807)
│ └─── ₂open-right⟶_QNodePP(n=5, v=-78.000)
├─── ₁open-left⟶_QNodePP(n=10, v=-60.390)
│ ├─── ₀tiger-left⟶_VNodePP(n=2, v=10.000)(depth=1)
│ └─── ₁tiger-right⟶_VNodePP(n=6, v=10.000)(depth=1)
│ ├─── ₀listen⟶_QNodePP(n=2, v=-1.000)
│ ├─── ₁open-left⟶_QNodePP(n=2, v=10.000)
│ └─── ₂open-right⟶_QNodePP(n=2, v=-45.000)
└─── ₂open-right⟶_QNodePP(n=4, v=-92.500)
vi_pruning output .alphas file:
2
-100.9500000000000028421709430 9.0500000000000007105427358
1
-16.0574999770099999807371205 6.9324999770100008689155402
1
-1.9500000000000001776356839 -1.9500000000000001776356839
1
6.9324999770100008689155402 -16.0574999770099999807371205
0
9.0500000000000007105427358 -100.9500000000000028421709430
-2.034 and -1.95 are close enough.
POUCT search tree (partial)
_VNodePP(n=99999, v=2.164)(depth=0)
├─── ₀listen⟶_QNodePP(n=99989, v=2.164)
│ ├─── ₀tiger-left⟶_VNodePP(n=49801, v=3.326)(depth=1)
│ │ ├─── ₀listen⟶_QNodePP(n=49781, v=3.326)
│ │ │ ├─── ₀tiger-left⟶_VNodePP(n=37173, v=6.550)(depth=2)
│ │ │ │ ├─── ₀listen⟶_QNodePP(n=374, v=-1.000)
│ │ │ │ ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│ │ │ │ └─── ₂open-right⟶_QNodePP(n=36796, v=6.550)
│ │ │ └─── ₁tiger-right⟶_VNodePP(n=12606, v=-1.000)(depth=2)
│ │ │ ├─── ₀listen⟶_QNodePP(n=12581, v=-1.000)
│ │ │ ├─── ₁open-left⟶_QNodePP(n=16, v=-38.125)
│ │ │ └─── ₂open-right⟶_QNodePP(n=9, v=-51.111)
│ │ ├─── ₁open-left⟶_QNodePP(n=3, v=-95.000)
│ │ └─── ₂open-right⟶_QNodePP(n=17, v=-38.235)
│ │ ├─── ₀tiger-left⟶_VNodePP(n=6, v=-1.000)(depth=2)
│ │ │ ├─── ₀listen⟶_QNodePP(n=3, v=-1.000)
│ │ │ ├─── ₁open-left⟶_QNodePP(n=2, v=-45.000)
│ │ └─── ₁tiger-right⟶_VNodePP(n=9, v=-1.000)(depth=2)
│ │ ├─── ₀listen⟶_QNodePP(n=7, v=-1.000)
vi_pruning output .alphas file:
2
-101.8525000000000062527760747 8.1475000000000008526512829
1
-28.3518061957880114221097756 7.2957562110763634066756822
1
-16.9599999770099998386285733 6.0299999770100010110240873
1
-4.8628187375409872572618042 4.3201187222526371556341473
1
2.3097999847116499338994799 2.3097999847116499338994799
1
4.3201187222526371556341473 -4.8628187375409872572618042
1
6.0299999770100010110240873 -16.9599999770099998386285733
1
7.2957562110763634066756822 -28.3518061957880114221097756
0
8.1475000000000008526512829 -101.8525000000000062527760747
2.164 and 2.309 are close enough.
POUCT search tree (partial)
_VNodePP(n=99999, v=1.469)(depth=0)
├─── ₀listen⟶_QNodePP(n=99992, v=1.469)
│ ├─── ₀tiger-left⟶_VNodePP(n=50056, v=2.662)(depth=1)
│ │ ├─── ₀listen⟶_QNodePP(n=50045, v=2.662)
│ │ │ ├─── ₀tiger-left⟶_VNodePP(n=37446, v=6.125)(depth=2)
│ │ │ │ ├─── ₀listen⟶_QNodePP(n=27829, v=6.125)
│ │ │ │ │ ├─── ₀tiger-left⟶_VNodePP(n=23068, v=9.442)(depth=3)
│ │ │ │ │ │ ├─── ₀listen⟶_QNodePP(n=191, v=-1.000)
│ │ │ │ │ │ ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│ │ │ │ │ │ └─── ₂open-right⟶_QNodePP(n=22874, v=9.442)
│ │ │ │ │ └─── ₁tiger-right⟶_VNodePP(n=4759, v=-1.000)(depth=3)
│ │ │ │ │ ├─── ₀listen⟶_QNodePP(n=4551, v=-1.000)
│ │ │ │ │ ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│ │ │ │ │ └─── ₂open-right⟶_QNodePP(n=205, v=-9.317)
ran with num_sims=200k instead of 100k:
_VNodePP(n=199999, v=1.572)(depth=0)
├─── ₀listen⟶_QNodePP(n=199992, v=1.572)
│ ├─── ₀tiger-left⟶_VNodePP(n=99693, v=2.711)(depth=1)
│ │ ├─── ₀listen⟶_QNodePP(n=99659, v=2.711)
│ │ │ ├─── ₀tiger-left⟶_VNodePP(n=74190, v=6.145)(depth=2)
│ │ │ │ ├─── ₀listen⟶_QNodePP(n=56780, v=6.145)
│ │ │ │ │ ├─── ₀tiger-left⟶_VNodePP(n=47076, v=9.371)(depth=3)
│ │ │ │ │ │ ├─── ₀listen⟶_QNodePP(n=218, v=-1.000)
│ │ │ │ │ │ ├─── ₁open-left⟶_QNodePP(n=3, v=-100.000)
│ │ │ │ │ │ └─── ₂open-right⟶_QNodePP(n=46855, v=9.371)
│ │ │ │ │ └─── ₁tiger-right⟶_VNodePP(n=9702, v=-1.000)(depth=3)
│ │ │ │ │ ├─── ₀listen⟶_QNodePP(n=9588, v=-1.000)
vi_pruning output .alphas file:
2
-97.8056900145239325183865731 12.1943099854760674816134269
1
-3.1749688868496956928311192 5.2204696298701351864224307
1
0.0716958745996567614611195 3.1762733082314071886287365
1
1.7955441981194137923694143 1.7955441981194137923694143
1
3.1762733082314071886287365 0.0716958745996567614611195
1
5.2204696298701351864224307 -3.1749688868496956928311192
0
12.1943099854760674816134269 -97.8056900145239325183865731
Though 1.469 seems slightly off from 1.795, increasing num_sims brought the estimate closer at 1.572.
The estimates are generally close, indicating POUCT in pomdp-py is doing a reasonable job approximating the optimal value in this domain. Though, as horizon increases, the estimate becomes poorer under the same number of simuations.