ICML_QHyer_rebuttal

Table 1: D4RL results with baselines re-run under the same goal-conditioned evaluation pipeline. LSDT and QT are re-run with goal concatenation ( –goalconcate ) under 5 seeds; DMixer results are from the original paper ( † ). Orange = best, underline = second best. Antmaze-v2 LSDT (rerun) DMixer † QT (rerun) QH YER umaze 76 4 ± 5 4 100 0 ± 0 5 90 ± 1 98 4 ± 1 9 umaze-diverse 76 8 ± 3 1 100 0 ± 0 5 92 ± 3 97 1 ± 2 3 medium-play 80 6 ± 14 4 – 41 ± 2 92 2 ± 3 5 medium-diverse 43 6 ± 20 7 60 0 ± 1 3 45 ± 7 1 94 0 ± 2 7 large-play 26 ± 15 5 – 40 ± 2 44 2 ± 1 9 large-diverse 37 ± 21 5 – 38 ± 4 57 5 ± 13 5 Total 340 4 – 346 483 4 Maze2d LSDT (rerun) DMixer † QT (rerun) QH YER umaze 74 3 ± 1 6 86 9 ± 1 9 96 0 ± 5 0 118 5 ± 1 9 medium 125 5 ± 22 2 95 2 ± 7 7 159 9 ± 15 7 173 0 ± 11 9 Total 199 8 182 1 255 9 291 5 Table 2: Comparison with recent offline RL methods adapted to the offline GCRL setting on OGBench play datasets. All methods use HER for goal relabeling. Results are mean ± std (%) success rate over 4 seeds. Orange = best. Environment GC-TrL GC-SHARSA GC-DEAS GC-QCFQL QH YER cube-single-play 6 3 ± 0 3 1 3 ± 1 2 20 4 ± 3 2 15 8 ± 2 9 84 ± 4 cube-double-play 1 1 ± 0 2 0 1 ± 0 2 5 3 ± 2 7 5 4 ± 0 4 56 ± 2 cube-triple-play 0 7 ± 0 3 0 0 ± 0 0 5 0 ± 2 3 5 6 ± 2 2 10 ± 5 cube-quadruple-play 0 0 ± 0 0 0 0 ± 0 0 0 0 ± 0 0 0 0 ± 0 0 2 ± 1 scene-play 4 2 ± 1 6 2 8 ± 1 0 16 0 ± 2 3 19 2 ± 4 5 53 ± 2 puzzle-3x3-play 1 8 ± 0 8 4 8 ± 4 0 3 0 ± 1 0 2 1 ± 1 5 92 ± 2 puzzle-4x4-play 0 4 ± 0 2 0 6 ± 0 7 1 7 ± 0 6 2 7 ± 2 8 28 ± 5 puzzle-4x5-play 0 1 ± 0 2 0 1 ± 0 2 0 0 ± 0 0 0 3 ± 0 3 31 ± 1 puzzle-4x6-play 0 0 ± 0 0 0 0 ± 0 0 0 0 ± 0 0 0 0 ± 0 0 18 ± 2 Average 1 6 1 1 5 7 5 7 41.6 Table 3: OGBench navigation and visual manipulation benchmarks. Comparison with Graph-based Stitching (GAS, ICML 2025). Results are mean ± std (%) over 5 seeds. Orange = best. Environment GAS QH YER antmaze-giant-stitch (navigation) 88 ± 4 70 ± 2 visual-scene-play (manipulation) 54 ± 6 96 ± 1 Table 4: Ablation study on architectural variants. All variants use the same NFs-based Q-conditioning. Attention-only : convdim=0 ; Mamba-only : convdim=h_dim ; Hybrid : learned gating (default QH YER ). Results are mean ± std (%) over 4 seeds. Orange = best, underline = second best. Environment Attention-only Mamba-only Hybrid (QH YER ) cube-single-play (non-Markovian) 74 ± 1 80 ± 2 84 ± 4 cube-single-noisy (Markovian) 60 ± 3 91 ± 3 95 ± 5 0.0 0.5 1.0 1.5 2.0 2.5 t 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Density Ratio: 2.8 × (a) Mamba t Distribution Play (non-Markovian, ( a t | s t , h < t )) Noisy (Markovian, ( a t | s t )) 0 5 10 15 20 25 30 Sequence Position 0.2 0.4 0.6 0.8 1.0 1.2 Mean t 2.8× (b) t vs. Sequence Position Play (non-Markovian) Noisy (Markovian) Play: Attn Play: Mamba Noisy: Attn Noisy: Mamba 0.3 0.4 0.5 0.6 0.7 0.8 Gate Weight Play Noisy (c) Attention vs Mamba Gate Weights Figure 1: Content-adaptive ∆ t visualization for Mamba’s selective SSM in QH YER ( cube-single ). Left: Distribution of ∆ t values: play data yields systematically smaller ∆ t (slower forgetting, longer effective memory) compared to noisy data (larger ∆ t , faster forgetting). Center: Mean ∆ t as a function of sequence position. Right: Learned gate weights between attention and Mamba branches across dataset types. Table 5: Mamba ∆ t and gate weight statistics on cube-single extracted from trained QH YER model (50 batches, batch size 256). Recall Eq. (13): ̄ A t = exp(∆ t · A ) where A < 0 . Smaller ∆ t ⇒ ̄ A t ≈ 1 ⇒ longer effective memory (history preservation); larger ∆ t ⇒ ̄ A t ≈ 0 ⇒ shorter effective memory (local focus). Effective memory length is defined as the number of past steps k such that ∏ k j =1 ̄ A t − j +1 > 0 5 Metric play (non-Markovian) noisy (Markovian) Mean ∆ t 0 38 1 05 Std ∆ t 0 12 0 31 Mean ̄ A t = exp(∆ t · A ) 0 92 0 61 Effective memory length (steps) ∼ 12 ∼ 3 Mean gate weight (Attention branch) 0 57 0 42 Mean gate weight (Mamba branch) 0 43 0 58