VISION-LANGUAGE MODEL FOR ROBOT GRASPING

Keshari, Abhinav Kaushal

doi:10.25394/PGS.22687645.v1

Abhinav Kaushal Keshari.pdf (1.81 MB)

VISION-LANGUAGE MODEL FOR ROBOT GRASPING

thesis

posted on 2023-05-01, 00:03 authored by Abhinav Kaushal KeshariAbhinav Kaushal Keshari

Robot grasping is emerging as an active area of research in robotics as the interest in human-robot interaction is gaining worldwide because of diverse industrial settings for sharing tasks and workplaces. It mainly focuses on the quality of generated grasps for object manipulation. However, despite advancements, these methods need to consider the human-robot collaboration settings where robots and humans will have to grasp the same objects concurrently. Therefore, generating robot grasps compatible with human preferences of simultaneously holding an object is necessary to ensure a safe and natural collaboration experience. In this work, we propose a novel, deep neural network-based method called CoGrasp that generates human-aware robot grasps by contextualizing human preference models of object grasping into the robot grasp selection process. We validate our approach against existing state-of-the-art robot grasping methods through simulated and real-robot experiments and user studies. In real robot experiments, our method achieves about 88% success rate in producing stable grasps that allow humans to interact and grasp objects simultaneously in a socially compliant manner. Furthermore, our user study with 10 independent participants indicated our approach enables a safe, natural, and socially aware human-robot objects' co-grasping experience compared to a standard robot grasping technique.

To facilitate the grasping process, we also introduce a vision-language model that works as a pre-processing system before the grasping action takes place. In most settings, the robots are equipped with sensors that allow them to capture the scene, on which the vision model is used to do a detection task and objectify the visible contents in the environment. The language model is used to program the robot to make it possible for them to understand and execute the required sequence of tasks. Using the process of object detection, we build a set of object queries from the sensor image and allow the user to provide an input query for a task to be performed. We then perform a similarity score among these queries to localize the object that needs attention, and once identified, we can use a grasping process for the task at hand.

History

Degree Type

Master of Science in Electrical and Computer Engineering

Department

Electrical and Computer Engineering

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Ahmed H. Qureshi

Advisor/Supervisor/Committee co-chair

David Iseri Inouye

Additional Committee Member 2

Irith Pomeranz

Usage metrics

Keywords

Collaborative robots grasping network Deep Learning Automates vision language navigation

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

VISION-LANGUAGE MODEL FOR ROBOT GRASPING

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Usage metrics

Categories

Keywords

Licence

Exports