New deep learning technique paves path to pizza-making robots

That’s amore

Handling deformable objects with reinforcement learning and deep learning

If an AI system wants to handle an object, it has to be able to detect and define its state and predict how it will look in the future. This is a problem that has been largely solved for rigid objects. With a good set of training examples, adeep neural networkwill be able to detect a rigid object from different angles. However, when it comes to deformable objects, the space of possible states becomes much more complicated.

“For rigid objects, we can describe its state with six numbers: Three numbers for its XYZ coordinates and another three numbers for its orientation,” Xingyu Lin, Ph.D. student at CMU and lead author of the DiffSkill paper, told TechTalks.

“However, deformable bodies, such as the dough or fabrics, have infinite degrees of freedom, making it much more difficult to describe their states precisely. Furthermore, the ways they deform are also harder to model in a mathematical way compared to rigid bodies.”

The development of differentiable physics simulators enabled the application of gradient-based methods to solve deformable object manipulation tasks. This is in contrast to the traditionalreinforcement learningapproach that tries to learn the dynamics of the environment and objects through pure trial-and-error interactions.

DiffSkill was inspired byPlasticineLab, a differentiable physics simulator that was presented at the ICLR conference in 2021. PlasticineLab showed that differentiable simulators can help short-horizon tasks.

But differentiable simulators still struggle with long-horizon problems that require multiple steps and the use of different tools. AI systems based on differentiable simulators also require the agent to know the full simulation state and relevant physical parameters of the environment. This is especially limiting for real-world applications, where the agent usually perceives the world through visual and depth sensory data (RGB-D).

“We started to ask if we can extract [the steps required to accomplish a task] as skills and also learn abstract notions about the skills so that we can chain them to solve more complex tasks,” Lin said.

DiffSkill is a framework where the AI agent learns skill abstraction using the differentiable physics model and composes them to accomplish complicated manipulation tasks.

Lin’s past work was focused on using reinforcement learning for the manipulation of deformable objects such as cloth, ropes, and liquids. For DiffSkill, he chose dough manipulation because of the challenges it poses.

“Dough manipulation is particularly interesting because it cannot be easily performed with the robot gripper, but requires using different tools sequentially, something humans are good at but is not very common for robots to do,” Lin said.

Once trained, DiffSkill can successfully accomplish a set of dough manipulation tasks using only RGB-D input.

Learning abstract skills with neural networks

DiffSkill is composed of two key components: a “neural skill abstractor” that usesneural networksto learn individual skills and a “planner” that composes the skill to solve long-horizon tasks.

DiffSkill uses a differentiable physics simulator to generate training examples for the skill abstractor. These samples show how to achieve a short-horizon goal with a single tool, such as using a roller to spread the dough or a spatula to displace the dough.

These examples are presented to the skill abstractor as RGB-D videos. Given an image observation, the skill abstractor must predict whether the desired goal is feasible or not. The model learns and tunes its parameters by comparing its prediction with the actual outcome of the physics simulator.

At the same time, DiffSkill trains a variational autoencoder (VAE) to learn a latent-space representation of the examples generated by the physics simulator. The VAE encodes images in a lower-dimension space that preserves important features and discards information that is not relevant to the task. By transferring the high-dimensional image space into the latent space, the VAE plays an important role in enabling DiffSkill to plan over long horizons and predict outcomes by observing sensory data.

One of the important challenges of training the VAE is making sure it learns the right features and generalizes to the real world, where the composition of visual data is different from those generated by the physics simulator. For example, the color of the roller pin or the table is not relevant to the task, but the position and angle of the roller and the location of the dough are.

Currently, the researchers are using a technique called “domain randomization,” which randomizes the irrelevant properties of the training environment such as background and lighting, and keeps the important features such as the position and orientation of tools. This makes the VAE more stable when applied to the real world.

“Doing this is not easy, as we need to cover all possible variations that are different between the simulation and the real world [known as the sim2real gap],” Lin said. “A better way is to use a 3D point cloud as representation of the scene, which is much easier to transfer from simulation to the real world. In fact, we are working on a follow-up project using point cloud as input.”

Planning long-horizon deformable object tasks

Once the skill abstractor is trained, DiffSkill uses the planner module to solve long-horizon tasks. The planner must determine the number and sequence of skills needed to go from the initial state to the destination.

This planner iterates over possible combinations of skills and the intermediate outcomes they yield. The variational autoencoder comes in handy here. Instead of predicting full image outcomes, DiffSkill uses the VAE to predict the latent-space outcome of intermediate steps toward the final goal.

The combination of abstract skills and latent-space representations makes it much more computationally efficient to draw a trajectory from the initial state to the goal. In fact, the researchers didn’t need to optimize the search function and used an exhaustive search of all combinations.

“The computation is not too much since we are planning over the skills and the horizon is not very long,” Lin said. “This exhaustive search eliminates the need for designing a sketch for the planner and might lead to novel solutions not considered by the designer in a more general way, although we did not observe this in the limited tasks we tried. Furthermore, more sophisticated search techniques could be applied as well”

According to the DiffSkill paper, “optimization can be done efficiently in around 10 seconds for each skill combination on a single NVIDIA 2080Ti GPU.”

Preparing the pizza dough with DiffSkill

The researchers tested the performance of DiffSkill against several baseline methods that have been applied to deformable objects, including two model-free reinforcement learning algorithms and a trajectory optimizer that only uses the physics simulator.

The models were tested on several tasks that require multiple steps and tools. For example, in one of the tasks, the AI agent must lift the dough with a spatula, place it on a cutting board, and spread it with a roller.

The results show that DiffSkill is significantly better than other techniques at solving long-horizon, multiple-tool tasks using only sensory information. The experiments show that when well trained, DiffSkill’s planner can find good intermediate states between the initial and goal states and find decent sequences of skills to solve tasks.

“One takeaway is that a set of skills can provide very important temporal abstraction, allowing us to reason over long-horizon,” Lin said. “This is also similar to how human approaches different tasks: thinking at different temporal abstractions instead of thinking what to do at every next second.”

However, there are also limits to DiffSkill’s capacity. For example, when performing one of the tasks that required three-stage planning, DiffSkill’s performance degrades significantly (though it is still better than other techniques). Lin also mentioned that in some cases, the feasibility predictor produces false positives. The researchers believe that learning a better latent space can help solve this problem.

The researchers are also exploring other directions to improve DiffSkill, including a more efficient planner algorithm that can be used for longer horizon tasks.

Lin hopes that one day, he can use DiffSkill on real pizza-making robots. “We are still far from this. Various challenges emerge from control, sim2real transfer, and safety. But we are now more confident at trying some long-horizon tasks,” he said.

This article was originally published by Ben Dickson onTechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. You can read the original articlehere.

Story byBen Dickson

Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook(show all)Ben Dickson is the founder ofTechTalks. He writes regularly about business, technology and politics. Follow him onTwitterandFacebook

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

More TNW

About TNW

French startup Poolside nears $3B valuation for AI that can write code

Europe has opened a door to a universal wallet. The web’s inventor wants to enter

Discover TNW All Access

How AI can help you make a computer game without knowing anything about coding

Holiday homes platform launches ‘global first’ visual search engine