Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

1University of Technology Sydney     2York University     3University of New South Wales     4Macquarie University

AAAI 2026

Abstract

3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data.

To address this, we propose Uni-Adapter a training-free test-time adaptation method for 3D Vision-Language Foundation Models (VLFMs). It maintains a dynamic prototype cache that captures intra-class variability, applies graph-based label smoothing to enforce consistency among similar prototypes, and fuses predictions from both the VLFM and the cache using entropy-weighted aggregation. Without any retraining, Uni-Adapter significantly improves robustness, boosting performance by 10.55% on ModelNet-40C, 8.26% on ScanObjectNN-C, and 4.49% on ShapeNet-C.

Bathtub

Bathtub

Bench

Bench

Chair

Chair

Flowerpot

Flowerpot

Person

Person

Cup

Cup

Figure 1: Visualization of intra-class variations across different object categories using t-SNE: Bathtub, Bench, Chair, Flowerpot, Person, and Cup.

Main idea

t-SNE Visualization of Uni3D Embeddings

t-SNE Visualization of Uni3D Embeddings and Prototype Caching Strategies

Figure 2. (a) t-SNE of Uni3D embeddings for the airplane class in ModelNet40-C shows clear intra-class clustering patterns. Confidence-based prototypes (triangles) cache only high-confidence samples, while cluster-based prototypes (circles) represent distribution modes via online clustering.

(b) In the toy example, confidence-based caching leads to incorrect boundaries due to poor mode coverage, whereas cluster-based caching captures diverse patterns and enables correct predictions.

Overview

Overview Image

Overview of Training-free Dynamic Adapter (Uni-Adapter)

Figure 3. Method Overview. Given a test point cloud \( \mathbf{X}_t \in \mathbb{R}^{L \times 3} \), our method extracts a point cloud feature \( \mathbf{f}_t \) via a point cloud encoder. The 3D cache is updated through online Prototyping, where cluster centers serve as 3D prototypes. The Prototype Reassignment module refines these prototypes, and their affinity with \( \mathbf{f}_t \) produces \( \mathbf{s}^{\text{cache}} \). Finally, the prediction logit \( \mathbf{s}^{\text{final}} \) is obtained by fusing \( \mathbf{s}^{\text{cache}} \) with the model’s base output \( \mathbf{s}^{\text{main}} \) using entropy-driven confidence weighting.


.

3D VLFM Throughput Comparison

We compare the throughput performance of Uni-Adapter against existing 3D VLFMs and cache-based approaches on the ModelNet40-C dataset. All results are obtained with batch size = 1 on an RTX 4090 GPU.

Throughput (t/s) comparison of 3D VLFMs and cache baselines on ModelNet40-C
Method Uni3D OpenShape ULIP-2
Zero-shot 39.19 15.90 23.94
TDA 36.02 14.43 21.78
Point-Cache 9.73 9.74 11.11
Uni-Adapter (Ours) 36.93 15.06 22.67

ModelNet40-C Results

Tab 1. Top-1 accuracy (%) on ModelNet40-C under distribution shifts using Uni3D-Large (batch size = 1).
Source-Only shows performance without adaptation. Best and second-best are in bold and underline.
* denotes VLFM-based TTA methods.
Method unigaubacimpupsrbfrbi deddeisherotcutdisocllidAvg.
Source-Only 57.3754.0170.2161.9160.6951.7452.39 67.5074.8772.4071.0263.9758.9547.2422.9359.15
TENT (ICLR21)61.4658.3065.2856.3665.0851.6252.5165.4475.2472.3672.4461.4358.1848.0928.4459.48
SHOT (ICML20)61.5058.4265.2756.6864.6651.5452.8465.4875.5772.8172.2961.1458.1448.4628.4459.55
SAR (ICLR23)61.5158.1865.4056.7764.7951.8653.4066.3775.9372.4972.0862.2859.1249.3130.7160.01
DUA (CVPR22)61.2657.9065.1156.7664.7951.3853.0865.4475.1672.9372.4160.8558.3548.5028.6559.50
MEMO (NIPS22)57.3854.0870.2461.9260.7151.7652.3967.5974.8872.4871.0363.9859.3047.3622.9459.20
TPT* (NIPS22)60.6557.2076.3261.1563.4755.3955.7871.3674.7775.2073.0165.4160.9747.6018.1061.02
LAME (CVPR22)57.3354.0970.1061.6360.9451.9452.2767.5075.1672.4571.2763.9059.0447.3322.6159.17
3DD-TTA (WACV25)60.5359.7663.5364.7167.9150.7351.0959.3666.9864.2658.6759.3653.7736.1032.3756.06
CloudFixer † (ECCV24)65.4465.8463.9068.1172.7752.8053.8152.8863.2159.3661.1856.1652.1829.0924.6056.09
T3A (NIPS21)69.4070.2641.8963.3370.7461.2659.2772.2079.3377.5978.3671.0765.5149.5932.0164.12
TDA* (CVPR24)62.2061.6375.1665.5267.1857.4657.9470.9576.9474.4373.2667.4663.1750.6930.3963.63
Point-Cache*64.3464.8773.9568.3171.6862.8465.1973.2277.8077.1575.7769.7768.3154.7832.9866.73
Uni-Adapter (Ours) 66.8265.5278.3272.2572.0465.6066.61 77.5180.6379.0579.3075.2973.38 56.9236.2669.70

ShapeNet-C Results

Tab 1. Top-1 accuracy (%) on ShapeNet-C under distribution shifts using Uni3D-Large (batch size = 1).
Source-Only shows performance without adaptation. Best and second-best are in bold and underline.
* denotes VLFM-based TTA methods.
Method unigaubacimpupsrbfrbi deddeisherotcutdisocllidAvg.
Source-Only 60.3355.7565.9565.0259.04 59.4160.2379.0671.07 75.6273.8776.8263.22 2.371.0657.92
TENT (ICLR21)59.8854.3058.3161.1457.7858.6960.1779.3072.6975.8374.5477.0863.232.972.5457.23
SHOT (ICML20)59.9654.3258.3761.1457.6758.8160.1979.2372.5375.9074.5777.0963.343.022.5157.24
SAR (ICLR23)59.8654.0958.5960.8457.4858.6560.0979.0473.2374.7871.3977.0461.882.691.4056.74
DUA (CVPR22)59.8554.3758.2661.2157.8958.7160.1579.1172.4072.5575.9177.0563.242.972.5357.08
MEMO (NIPS22)60.3355.7666.0265.0259.0459.4260.2379.0170.9275.6273.9576.8263.222.411.0957.92
TPT* (NIPS22) 62.8756.6369.2064.7059.1058.2960.43 81.5975.2376.9374.5680.52 63.022.171.2559.10
LAME (CVPR22)60.4355.8966.0465.1259.0959.4260.3379.1371.1675.7474.0876.9963.352.311.0258.01
3DD-TTA (WACV25)65.7864.1655.0061.7568.0555.0656.2074.2067.3568.0661.1973.0156.292.500.9755.30
CloudFixer † (ECCV24)65.5765.3058.1569.5363.6555.0256.7169.8958.6765.6565.6470.3655.093.752.4657.24
T3A (NIPS21)60.6053.1220.7044.1949.3146.3544.3770.3163.0164.8663.8368.2452.601.000.8746.89
TDA* (CVPR24)62.7558.9568.3367.1462.0961.2862.5679.0071.4475.9374.7977.1764.443.821.8159.43
Point-Cache* (CVPR25) 62.6356.7166.5165.8561.15 59.7961.4975.8969.4772.6170.8173.8263.413.641.6757.70
Uni-Adapter (Ours) 66.8962.2371.3868.6264.15 67.4267.3380.7675.69 78.1177.0179.6470.14 4.362.4762.41

ScanObjectNN-C Results

Tab 3. Top-4 accuracy (%) on ScanObjectNN-C using Uni3D-Large (batch size = 1).
* denotes VLFM-based TTA. Source-Only shows performance without adaptation.
Best and second-best are in bold and underline.
Method unigaubacimpupsrbfrbi deddeisherotcutdisocllidAvg.
Source-Only 29.7825.9940.6245.9630.64 33.0534.4256.2847.16 54.2254.0456.8043.55 9.988.6138.07
TENT (ICLR21)29.7826.1649.9151.4630.6533.2236.3255.0845.2752.5053.1855.7744.927.063.7938.34
SHOT (ICML20)29.6026.8551.1252.3231.3333.7337.2056.8045.7854.5754.2256.3145.276.723.9639.05
SAR (ICLR23)29.0827.5442.1744.5831.3332.3634.7756.2844.4152.8454.2255.5944.069.649.6437.90
DUA (CVPR22)29.9527.3741.6544.9230.8132.5334.0856.6345.0953.8754.2255.7743.899.819.4738.00
MEMO (NIPS22)29.7726.1649.9151.4630.6433.2236.3255.0845.2752.5053.1855.7744.927.053.7838.34
TPT* (NIPS22) 30.0428.4340.9546.7632.6835.5534.3956.6050.6653.8154.3960.3042.70 11.136.0738.96
LAME (CVPR22)29.6026.8551.1252.3231.3333.7337.1856.8045.7854.5654.2256.2945.276.713.9639.05
3DD-TTA (WACV25)32.1930.8127.7139.5934.6025.8226.5145.6138.0436.1433.3945.0933.058.614.4730.78
CloudFixer † (ECCV24) 36.8333.2236.1448.1937.69 27.5430.4645.4440.2838.7338.2146.1335.28 10.6711.7034.43
T3A (NIPS21) 34.9432.7034.0843.3733.22 36.4936.3262.6555.42 61.4662.3164.0356.97 9.478.6142.14
TDA* (CVPR24)31.3328.4052.6753.3632.3636.3240.4556.8046.8255.2555.7657.6649.408.094.6540.62
Point-Cache* (CVPR25) 30.9827.1955.2556.4533.73 40.7943.0359.5049.2360.0756.6357.6649.05 8.264.3042.13
Uni-Adapter (Ours) 35.2837.6953.3559.55 39.0742.1652.4960.58 51.9761.6160.2460.92 54.3820.824.8146.33

BibTeX


@misc{tamjidi2025adaptasyouwalkcloudstrainingfreeonline,
      title={Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models}, 
      author={Mehran Tamjidi and Hamidreza Dastmalchi and Mohammadreza Alimoradijazi and Ali Cheraghian and Aijun An and Morteza Saberi},
      year={2025},
      eprint={2511.15311},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15311}, 
}