🧠 Intuition 🧠



How do you open a door in the dark? 🌃

If I tell you where to grasp, and you should open to the left or to the right, how would you proceed?

Yes, you can grasp the handle easily (as I told you where it is) and open the door by slowly adjusting your action according to the door's movement! So we believe that for grasping, vision 👁️ is crutial but for opening, it might not be the most important input!


What is our approach?

We train an RL policy that can learn door motion from history observations and adaptively adjust its action. We use a closed-loop system equipped with 💫Impedance Control💫 for better sim-to-real and reach 84% success rate for opening door in the real world!

How do you open a door in the dark? If I tell you where to grasp, and you should open to the left or to the right, how would you proceed?


Yes, you can grasp the handle easily (as I told you where it is) and open the door by slowly adjusting your action according to the door's movement! So we believe that for grasping, vision 👁️ is crutial but for opening, it might not be the most important input!


What is our approach? We train an RL policy that can learn door motion from history observations and adaptively adjust its action. We use a closed-loop system equipped with 💫Impedance Control💫 for better sim-to-real and reach 84% success rate for opening door in the real world!

🏂 Quick Summary 🏂

🔥 Our pipeline 🔥


Why articulated object manipulation is hard?

  • To safely open a door, you need to understand its articulation motion. This includes object weight, friction, pivot radius,... which we only know after interacting with the object
  • Therefore, using open-loop pipeline or predicting the action waypoint before-hand is not enough
But can we get these values from vision?
  • Ok let's say we have a constant visual feedback of during manipulation, will it help? You're now pretty sure that you grasped the door handle or even have it opened a bit. But the large vision sim2real gap makes it hard to generalize.
  • One more thing, this handle might be occluded in while opening, meaning that you have to find a good viewpoint to really capture things you need. Again not really generalizable.



Here we try to make a shortcut to learn object motion without vision during the opening stage. Let's say we have a reliable grasping pose so that we can ensure that the grasping stage would be successful. Now we can think that the robot arm is sticked to the door handle, meaning that we can infer the handle position from the end-effector position (as well as its motion). We compare the true consequential positions of the handle and the actions that we applied to understand the articulation motion.

This is the idea behind Adaptation Module and Priviledge Observation Encoder, which are also introduced in many locomotion pipelines.

🦾 Train in Simulation 🦾


But how do you use vision in the simulation? Do you distill it?

Our training in simulation is entirely "state-based". In the real world, the key value that we miss is the handle position. But in simulation, we can directly query it anytime, anywhere. In the real world, we infer this handle position from end-effector pose, meaning that we can skip vision for good.

Thanks to this, we learn a smooth and continous action motion that follow the true motion of the object with 💫Impedance Control💫. Its high tolerance help us adaptively learn object motion and adjust our actions.

How's your training different from other state-based policies?

We care about realistic embodiments. We don't use flying gripper or movable Franka and we control the controller parameter so that the robot-object interaction would stay close to what would happend in the real world. Most of the time, the Franka movement in the simulation is relaxed too much, allowing the policy to learn dangerous or even impossible configuration. Also, we set a reasonable workspace so that it can directly sim2real.

But these contraints would inevitably make the training harder. We design a stage-aware reward system that make our motion continous and smooth, following the reach-grasp-open order.

🌏 Qualitative Results 🌏


We tested our policy with a wide range of unseen objects, varied in size, apperance, left-hinged right-hinged, stiffness in a large workspace.


Our policy generates smooth and continous motion, which show high tolerance to object's motion. This video is captured in 1X.


As we use vision only at the first frame, our policy is very robust. We capture 1 RGBD image, infer a grasping pose and then directly rollout our policy. It can reach human's speed without significant lagging phases as with waypoints.

Some more Rollouts