For full setup instructions with pictures and videos, refer to the setup docs.
VLA-REPLICA utilizes a simple Python script for benchmarking as well as the LeRobot library for SO-101 control.
GPU VRAM usage is heavy during inference, especially for more complex VLAs like pi0.5, so a GPU with at least 24GB VRAM is recommended.
Clone the repository, create a new virtual environment (recommended) and install prequisites listed in the environments.yml file:
git clone https://github.com/IRVLUTD/VLAReplica.git
cd VLAReplica
conda env create -f environment.yml
conda activate vlareplica
Since the camera indices on every computer can vary, utilize leRobot's find-cameras command to list out the corresponding index numbers for the RealSense and Vinmooog cameras (run in terminal):
lerobot-find-cameras
Record the camera indices for two cameras.
Since the USB serial port on every computer can vary, utilize leRobot's find-port command to list out the corresponding serial port of the SO-101 follower arm. Run the following command in a terminal:
lerobot-find-port
and then unplug the SO-101 USB cable from the computer, and then press Enter.
The terminal will output something like: Device port: /dev/ttyACM1.
Record the serial port (e.g. /dev/ttyACM1) for the follower arm.
Calibration video from LeRobot:
Video 1: SO-101 Arm Calibration Procedure.-
Calibrate the SO-101 follower according to the LeRobot Docs. Follow the video carefully, and ensure each motor is at the middle position before starting the calibration process.
- (Note: This means for the wrist roll motor, the end-effector should be oriented so that the camera is rotated 90° and pointing towards the right side when looking at the end-effector head on)
- During calibration, thoroughly rotate each of the six motors to their physical joint limits. Don't forget any motors!
-
After calibration is complete, the
calibration.jsonfile is typically saved to~/.cache/huggingface/lerobot/calibration/robots/<your-robot-id>in your root folder. Copy the generated calibration JSON file intovlareplica/calibration/robots/so101_followerinside your repo directory. -
Rename it to
so101_follower_arm.json.
We first utilize an AprilTag mounted at a defined spot with respect to the box to allow general placement of the camera mount. Then, we utilize the idea of an image overlay to match the camera pose to the original VLA-Replica box camera pose as closely as possible.
-
In a new terminal inside the virtual environment, run the calibration script (replace your-top-camera-index with the number you recorded in Software Installation):
python calibration/camera/detect_apriltag.py --camera-index <your-top camera-index>A GUI window will pop up, displaying the live camera feed alongside the estimated AprilTag pose.
-
Reach inside the box and physically slide or tilt the camera mount along the PVC pipe until all reported values match the table below as close as possible (some error is acceptable):
X (m) Y (m) Z (m) R (deg) P (deg) Y (deg) -0.06 ± 0.01 -0.39 ± 0.01 1.25 ± 0.01 -18.5 ± 1.0 3.0 ± 1.0 2.5 ± 1.0 Once satisfied, press
qto exit the program.
Although the AprilTag pose estimator may output values close to Table A.2, there may still be slight camera misalignment. To solve this, we utilize visual overlay matching (see below) to ensure the camera view is as close as possible to VLA-REPLICA’s original view.
Video 2: Image Overlay Calibration Procedure.-
First, calibrate the top camera for the second time. Run the following (replacing
your-top-camera-idwith with the number you recorded in Software Installation):python calibration/camera/overlay.py --overlay-image-folder calibration/camera/referenceImages/top --base-cam <your-top-camera-id>A GUI window will pop up, overlaying the live top camera feed with a wrist view reference image. Match the view of your camera with the reference image by reaching into the box and sliding or tilting the camera mount along the PVC pipe.
-
Next, calibrate the wrist camera. Run the following (replacing
your-wrist-camera-idwith with the number you recorded in Software Installation):python calibration/camera/overlay.py --overlay-image-folder calibration/camera/referenceImages/wrist --base-cam <your-wrist-camera-id>A GUI window will pop up, overlaying the live wrist camera feed with a top view reference image. Slightly loosen the M3 screw on the wrist camera mount on the SO-101, and match the view of your camera with the reference image by rotating the camera mount along the end effector.
Before the next step, ensure that:
- All six pose values (x,y,z,R,P,Y) match the targets in the table:
- | X (m) | Y (m) | Z (m) | R (deg) | P (deg) | Y (deg) | | --- | --- | --- | --- | --- | --- | | -0.06 ± 0.01 | -0.39 ± 0.01 | 1.25 ± 0.01 | -18.5 ± 1.0 | 3.0 ± 1.0 | 2.5 ± 1.0 |
- Reference images and camera views match almost identically for both top and wrist cameras.
Congrats! The environment setup is complete, and you are ready to start benchmarking your VLA models!
Use the evaluation script benchmark.py to run a policy across predefined ID or OOD tasks, with predefined reference images. Refer to the table below for all CLI flags.
Currently, the script supports the following models: {act,smolvla,dit,xvla,pi0,pi05}. Support for other VLA models will arrive soon. Feel free to modify the script to implement other VLA models of your liking.
Inside your virtual environment, run:
python benchmark.py \
--policy-type pi0 \
--policy-path lerobot/pi0_base \
--policy-from-hub \
--run-all-tasks \
--task-subset ID \
--iterations 5 \
--eval-follower-calib-dirs calibration/robots/so101_follower \
--eval-follower-ports /dev/ttyACM1 \
--eval-follower-ids so101_follower_arm \
--eval-top-indexes 4 \
--eval-wrist-indexes 14 \
--reset-mode fixed \
--reset-action-file arm_reset.json
| Flag | Description |
|---|---|
--policy-type <model> |
Selects the policy family to evaluate. Currently supported models: {act,smolvla,dit,xvla,pi0,pi05} |
--policy-path <path> |
Hugging Face repo ID or local path for the policy checkpoint. |
--policy-from-hub |
If --policy-path directs to a Hugging Face repo ID, include this flag. Loads policy from Hugging Face Hub instead of local directory. |
--run-all-tasks |
Runs evaluation across all 10 VLA-REPLICA tasks from task config, instead of single task. |
--task-subset <ID or OOD> |
When using --run-all-tasks, restricts evaluation to ID or OOD task subset. |
--iterations <number> |
Number of evaluation iterations per task (we used 5 in the paper). |
--eval-follower-calib-dirs <path> |
Follower calibration directory. (default: calibration/robots/so101_follower). |
--eval-follower-ports <serial port> |
Serial port for the follower robot (e.g. dev/ttyACM1) |
--eval-follower-ids <id> |
Robot ID for the follower arm. (default: so101_follower_arm) |
--eval-top-indexes <index> |
Top-camera index for the active arm. |
--eval-wrist-indexes <index> |
Wrist-camera index for the active arm. |
--reset-mode fixed |
Uses a fixed reset action instead of teleoperated leader reset (we enabled this for the paper). |
--reset-action-file <path> |
JSON file containing the normalized reset action vector required when --reset-mode fixed is used. (default: arm_reset.json) |
-
After the script loads the corresponding policy and connects successfully to the followers, the follower arm will move to a consistent start position (predetermined in
arm_reset.json). An openCV GUI will pop up, overlaying the live video feed from the top camera with the proper test scene (i.e. predefined object placements) for that task. -
Grab the corresponding objects needed for that scene (i.e. red plate and bread A for the first task) and then move the objects to their reference image positions so that the live camera and overlay image are identical to each other.
benchmark.py live video evaluation GUI. The user is currently setting up the scene for the "Put bread on plate" task.
-
When the live video feed and overlay image match almost exactly, press
Enteron the keyboard to start policy inference.- During policy evaluations for the VLA-REPLICA paper, each policy is given 90 seconds to complete the task before the iteration ends.
- If the policy completes the task before 90 seconds, press
right arrow (➜)to skip to the setup phase of the next iteration. The SO-101 arm will reset back to the start position.
-
Log success and/or failure behavior for each iteration corresponding to that specific task. The full list of tasks and criteron are listed below.
- ID tasks use scene layouts close to the training distribution to see how well the model learns.
- There are 10 ID tasks total, with 5 variants each, for a total of 50 ID iterations.
- OOD tasks test new colors, counts, or objects to test how well the model generalizes generalization.
- There are 8 ID tasks total, with 5 variants each, for a total of 40 ID iterations.
The full list of tasks is located under Task Reference
| Task | Goal | Success condition |
|---|---|---|
| Put bread on plate | Place the correct bread on the correct colored plate | Bread is resting on the target plate and the arm returns home |
| Put bowl on coaster | Place the correct bowl on the correct coaster | Correct bowl is on correct coaster and the arm returns home |
| Stack blocks | Stack the target block on the target block | Top block remains in contact for more than 2 seconds |
| Fold towel | Fold the towel in half | Edges are lifted and folded by more than 50% |
| Open oven | Open the oven door | Door stays open for 2+ seconds |
| Clean whiteboard | Wipe the board with the eraser | Eraser wipes 2+ times and is placed next to the board |
| Pour pepper | Pour the required number of shakes | Correct number of shakes poured and object returned |
| Lift bowl | Lift the correct bowl the required number of times | Correct lifting count is completed |
| Press button | Press the button the required number of times | Correct number of presses completed |
| Collect blocks | Put all blocks into the correct box | All blocks are in the target box and the arm returns home |
For full setup instructions with pictures and videos, refer to the setup docs.

