Since the beginning of summer 2024, my work has centered on analyzing and evaluating generalist agents in robotic manipulation, with a focus on vision-language models like GPT-4V and foundational robotics models like Octo. Despite recent breakthroughs that have pushed these models to achieve state-of-the-art results across multiple tasks, a significant gap persists in conducting thorough, low-level analyses of these systems within the robotics field. This gap hinders the identification of areas for improvement and complicates fair comparisons between robotic generalist agents across different scenarios. My current work seeks to address this research gap by offering a more in-depth evaluation of these models. I aim to publish my current research projects in top-tier conferences such as Robotics: Science and Systems (RSS) and the Conference on Robot Learning (CoRL) in 2025. Press the toggles below to learn more about the underlying projects.
Working on a framework designed to address the gaps in current evaluation practices for generalist policies like Octo and OpenVLA. The framework aims to offer a more detailed and thorough evaluation of generalist robot policies by utilizing real-world datasets like DROID. HERO goes beyond basic success rates by incorporating action-based metrics that evaluate performance at each step of task execution, providing a deeper understanding of policy behavior. The framework will enable detailed task-specific analysis and sensitivity evaluations, offering a more precise understanding of agent performance across diverse real-world scenarios.
Worked on a novel benchmark designed to assess the low-level manipulation reasoning skills of VLMs across various dimensions, including their ability to predict temporal relations, understand object-object interactions, and handle deformable objects. This benchmark provides a structured evaluation pipeline where VLMs predict one of a set of keypoints, representing contact points and movement directions, to simulate precise robotic actions. This benchmark consists of multiple-choice questions (MCQs) derived from environments sourced through simulations, existing real-world datasets, or robot datasets that we have manually curated. All the experiments done as part of evaluating different VLM families like GPT-4 and InternVL on this benchmark are primarily done in Python and PyTorch. Furthermore, I have also been working on the real-world setup for this project where I have acquired expertise in the usage of ROS, Docker, Robot-Camera calibration, and URX for operating the bimanual UR-5 setup. We aim to demonstrate a correlation between the performance of the VLMs on our benchmark with the performance when used to guide real-world robot systems and we plan to submit this work to RSS 2025.
Since the spring of 2024, I have focused on exploring the use of recent vision-language models (VLMs) for deformable object manipulation tasks. While current approaches depend heavily on data-intensive learning methods such as imitation and reinforcement learning, recent research has demonstrated the potential of VLMs to serve as high-level task planners in robotic manipulation. My goal was to investigate if VLMs could act as low-level planners instead for zero-shot deformable object manipulation. This research led to significant results, culminating in the publication of my work at the International Symposium of Robotics Research (ISRR) 2024. I have currently been working towards the extension of my previous work, focusing on the limitations of GPT-Fabric and investigating ways in which its scope can be extended. Press the toggles below to learn more about the underlying projects.
Though our previous work on GPT-Fabric demonstrated impressive results for the canonical tasks of fabric smoothing and folding, it had some limitations. In spite of beating most prior works in terms of fabric folding performance, GPT-Fabric was not able to match the (then) state-of-the-art. Furthermore, GPT-Fabric did not leverage bimanual capabilities of the robot embodiment which could be beneficial for certain specific folding tasks nor were we able to experiment with the fine-tuning or code-generation capabilities of the underlying VLMs to improve the performance of fabric folding. To address these limitations, I have been advising a visiting research intern under Prof. Daniel Seita on exploring ways to move beyond the capabilities of Vision-Language Foundation Models previously leveraged by GPT-Fabric. Furthermore, we are also working towards demonstrating the generalizability of GPT-Fabric++ by benchmarking performance on fabric manipulation tasks on fabrics of different shapes and configurations, expanding the scope beyond rectangular fabrics.
Worked on GPT-Fabric, a framework that allows OpenAI's GPT model to perform fabric smoothing and folding tasks by directly generating low-level manipulation actions. We demonstrated state-of-the-art performance in fabric smoothing by GPT-Fabric through extensive simulation experiments. Furthermore, our method was able to achieve results comparable to the baselines for fabric folding, without needing any fabric-specific training data. Initial ablation experiments revealed that naïve prompting with fabric images led to poor performance, highlighting the necessity of GPT-Fabric's specialized approach to achieve reliable and effective manipulation results. This project sharpened my expertise in prompt engineering, providing me with a profound understanding of GPT-4's capabilities in the realm of robotic manipulation. Additionally, I gained expertise in SoftGym, the simulation environment for our experiments. Our submission at ISRR 2024 got accepted and I will be attending the conference at Long Beach, CA, USA.
Traditional Neural Machine Translation (NMT) models, especially those built on Transformer architectures, work well with high-resource language pairs but struggle with low-resource languages due to the lack of sufficient parallel data. Although various methods, like pre-training and integrating linguistic information, have been suggested to enhance NMT for low-resource languages, a notable performance gap still exists, particularly for Indic languages. Given the widespread use of these languages, including by myself, I worked on improving their NMT performance during my Bachelor's thesis at IIT Delhi in 2020 - 2021 and my first semester at USC in Fall 2023. Despite carefully designing multiple approaches and pursuing different avenues of experimentation, the results were underwhelming. During this period, the rapid progress in generalist agents, especially vision-language and robotic foundation models, piqued my interest. The diverse capabilities of these systems, combined with their potential for transformative applications, inspired me to shift my research focus from NMT to Foundation Models in Robotics. Press the toggles below to learn more about the underlying projects.
Integrated BERT-based models as source experts with Neural Machine Translation (NMT) models, specifically mBART, to enhance low-resource translation from languages like Nepali to English. Using embedding fusion, we combined the source expert's last hidden layer with mBART's embeddings and experimented with various fine-tuning strategies to preserve pre-trained knowledge. However, the combined approach showed limited BLEU score improvements compared to fine-tuned mBART. Information-theoretic analysis revealed higher information gain when connecting source expert NepBERTa to a pretrained mBART, compared to a fine-tuned model, suggesting noise in low-resource embeddings limited performance. Further analysis using a trainable adapter model provided even less information gain, emphasizing that issues like noisy data in low-resource datasets significantly impede translation accuracy. The code was primarily written in PyTorch. We utilized the computing resources available at USC for our experiments.
Developed a framework to improve machine translation (MT) for medium- and low-resource languages by tackling challenges in embedding transfer and overcoming data scarcity. Building on prior work that incorporated syntactic data with token embeddings, the project applied cross-linguistic embedding transfer by using synsets from IndoWordNet as well as by incorporating transliteration. This approach led to enhanced translation performance in low-data environments, particularly for Indian languages like Gujarati and Punjabi. A supervised MT corpus for 15 Indian languages was also curated to support broader MT research. However, this effort was ultimately superseded by similar-scale initiatives occurring concurrently within the research community. The framework, leveraging Transformer-based deep neural models, facilitates a nuanced analysis of syntactic transfer and embedding efficacy. It provides valuable insights into the performance of machine translation systems in linguistically diverse and low-resource contexts. The computational experiments were conducted on high-performance computing resources at IIT Delhi.
Even before discovering the fields of Machine Learning and Robotics, I had a natural inclination toward research. This passion was fueled by my love for teaching, the fulfillment I found in sharing knowledge, and the immense joy I experienced in helping others learn new concepts. Additionally, I have always been captivated by the idea of pushing boundaries and creating innovations that have the potential to bring significant and meaningful value to the world. My formal introduction to research in computer science occurred during my sophomore year at IIT Delhi, where I was exposed to groundbreaking advancements across various fields. Inspired by the work of Prof. Rahul Narain in Computer Graphics and Prof. Arnab Bhattacharyya in Algorithms, I began my research journey under their mentorship. Eager to gain industry research experience, I pursued a research internship at Adobe during the summer after my junior year. This opportunity provided me with my first formal exposure to state-of-the-art advancements in Machine Learning and Computer Vision, eventually shaping my interests as an early researcher.
During the summer of 2020, I worked as an undergraduate research intern at Adobe Research under the guidance of Dr. Sumit Shekhar. Our work focused on applying Deep Learning techniques to document beautification, where we aimed to enhance the visual appeal and readability of documents. Studies have shown that a document's visual appeal is judged within the first 50 milliseconds, making good design essential for a positive impression. However, documents often restrict styling and modifications, making it challenging for novice users to select appropriate templates, transfer styles, or add personal specifications. To address these challenges, we developed a learning-based system for Personalized Document Creation via Template-Based Style Transfer. This system automates the layout and style transfer process, while also incorporating personal specifications and providing attribute-specific recommendations to enhance user experience. Our work was published as a U.S Patent in the December of 2022. Press the toggles below to get more information about the underlying projects.
Developed a novel comprehensive system for document beautification that includes several key components such as: a preprocessed template dataset with extracted styles, a novel graph-based algorithm and a scoring metric for recommending templates based on their layouts, and a document rendering pipeline featuring a unique commonJson format for rendering. Trained a Generative Adversarial Network using PubLayNet dataset and utilized it to generate new layouts for performing the content transfer between the Input Document and a selected Template. Incorporated Homomorphic Interpolation to facilitate the layout generation and optimized it by designing a heuristic for effectively choosing the interpolation direction, significantly improving the generated sample efficiency. Additionally, we designed optimized models for font and color recommendations, generating palettes that are dependent on the input document.
In the winter of 2019, I visited the National University of Singapore, where I collaborated with Prof. Arnab Bhattacharyya and (now Prof.) Sutanu Gayen on Causal Inference and Streaming Algorithms. While our work on Streaming Algorithms did not yield substantial improvements over the baseline, our research on Causal Inference led to a publication at the International Conference on Artificial Intelligence and Statistics (AISTATS) in 2022. Press the toggles below to get more information about the underlying projects.
I contributed to developing an efficient algorithm for learning interventional distributions in causal Bayesian networks within the PAC (Probably Approximately Correct) framework. Building on the ID algorithm by Shpitser and Pearl, we developed a polynomial-time algorithm that, under certain conditions, learns a distribution that is ε-close to the true interventional distribution. I worked on generalizing the conditions from previous research by Bhattacharyya et. al., exploring methods to preserve both optimality and sample efficiency in the learning process. Another major contribution of this work was proving that, for arbitrary subsets of outcome variables, learning a close approximation of the interventional distribution is computationally hard unless certain complexity assumptions hold. This complexity analysis was beyond my expertise, requiring a deeper mathematical foundation than what I had acquired during my undergraduate studies. Our paper was accepted at the prestigious AISTATS conference in 2022.
Implemented algorithms like Heavy-Keeper and Double Space-Saving in C++ and evaluated their performance using datasets for network flows and movie recommendations. Initially, the project aimed to develop the Double Space-Saving method for efficiently detecting heavy hitters in a stream, with the goal of comparing its performance to state-of-the-art methods on existing datasets. However, we ultimately discontinued the project as the method did not outperform the baseline algorithms.
I began my research journey in 2019 by exploring theoretical approaches to understand and optimize existing methods for simulating deformable objects. However, these projects faced challenges in producing substantial results and were limited by high computational demands. Later, I briefly investigated ways to overcome the limitations of ADMM-based elastic solvers for simulating systems with transiently-appearing constraints. I worked acively on these different problems in Computer Graphics till my senior undergraduate year at IIT Delhi when I focused my efforts towards Neural Machine Translation as my Bachelor's Thesis. Press the toggles below to get more information about the underlying projects.
Investigated the limitations of ADMM-based elastic solvers to simulate systems with transiently-appearing constraints. I was involved in the design of separate constraint resolution modules wrapping around the solver to predict the constraint force distribution. To evaluate the effectiveness of these modules, I experimented with various simulated physical systems, comparing their performance against baselines to assess the improvements achieved.
While traditional SPH methods provide efficient simulations for compressible fluids, they struggle with time-step restrictions, stiffness parameter dependencies, and scaling challenges in large-scale scenarios, prompting the development of IISPH to improve stability, scalability, and convergence in incompressible fluid simulations. Though incorporating IISPH results in impressive fluid simulation results, we realized that they leverage backward euler as their integration method and wondered if higher-order numerical methods could be integrated with IISPH instead without significantly impacting the computational complexity of the system. I worked on studying the incorporation of numerical methods like Backward Differentiation Formula and Trapezoid Method towards the search for pressure forces to resolve the compression as a system of linear equations. This research project introduced me to the complex mathematical paradigms involved behind fluid mechanics and strengthened my understanding of linear algebra and calculus beyond the classroom setting.
While simulating hair and fur is common in the movie industry, those features are rarely seen in today’s computer games. The main difficulty of simulating hair in real time applications is the sheer number of hair strands and the fact that each hair is inextensible, making the progress computationally intractible. At the time of working on this project, incorporating Position Based Dynamics (PBD) as a solver for simulating inextensible deformable objects had been an active area of research. One such PBD based approach was the work on Dynamic Follow-The-Leader (DFTL) which was able to simulate thousands of hair strands in real time, however were no proven guarantess for how realistic would this method's generations be. I worked on evaluating the method theoretically and was able to prove and demonstrate that this method converged to the physically accurate solution in the limits of the time step tending to zero. Furthermore, I observed that DFTL, though computationally efficient, introduced a lot of artificial damping while simulating in real time. I attempted to address this limitation by experimenting with different ways to remodel the framework. However, my attempts were unable to yield any fruitful results and we eventually decided not to move forward with this project. Being the first research problem undertaken by me, this project introduced me to academic research and imparted me with the skills of experimental design and coherent expression of ideas, along with strengthening my grasp in MATLAB.
I have taken a number of courses related to Artificial Intelligence and Robotics while pursuing my undergrad and graduate studies at IIT Delhi and USC respectively. Being a part of these classes presented me with the opportunity to pursue numerous projects which helped strengthen my technical grasp in these fields. Press the toggles below to get more information about the underlying projects.
Despite the extensive research done on dynamic manipulation and dexterous manipulation recently, it has been challenging to leverage dexterous embodiments to perform dynamic manipulation. One such task in particular is to have robotic hands perform dynamic handover, however prior works have a limitation of the robots being in close promixity of each other. This project addresses that limitation by introducing a Reinforcement Learning (RL) based framework for dexterous hands to perform dynamic manipulation tasks of catching and throwing objects of different shapes, regardless of their physical promixity to one another. We worked on creating simulation mini-benchmarks from scratch on MuJoCo, where we utilized Soft Actor-Critic (SAC) as the RL algorithm for training our system. We combined proprioceptive features with image features for training the Visual SAC system.
Developed a comprehensive robotic grasping and motion planning pipeline for a 6-DOF robotic manipulator using the AIKIDO infrastructure and ROS-Noetic, which integrated Rapidly-exploring Random Trees (RRT), inverse kinematics (IK), and Jacobian-based control. Employed a Task Space Region (TSR) constraint to enhance the task of grasping a soda can, ensuring the robot sampled only accessible regions around the can. In addition, I optimized the algorithm's efficiency by adding a probabilistic goal-sampling function, which reduced computation while maintaining trajectory accuracy. Utilized RViz for visualizing the trajectory paths before executing the planner on real robot hardware.
Implemented an AI agent to play Little-Go, a simplified version of the Go game on a 5x5 board, from scratch. Incorporated advanced techniques like Minimax with alpha-beta pruning and reinforcement learning to help the agent handle complex moves. Utilized some key Go concepts like "Komi" for designing a heuristic evaluation function to effectively determine agent success without going through all move possibilities till the end of game. Scored tremendously well in the tournament against other agents, defeating bots implemented by other classmates.
This project compiles essential machine learning and NLP algorithms, each implemented from scratch and optimized extensively as part of the graduate level course of Machine Learning at IIT Delhi. In one assignment, I implemented Linear Regression with Batch and Stochastic Gradient Descent, Logistic Regression using Newton’s method, and Gaussian Discriminant Analysis for binary classification, using datasets like wine acidity-density pairs and salmon classification from different regions. In an other assignment, I developed Naive Bayes and Support Vector Machines, focusing on feature extraction and regularization to enhance classification accuracy. Furthermore, I programmed a module that covered Decision Trees built with entropy-based node splits, post-pruning for overfitting control. Finally, I created Neural Network architectures for tasks like alphabet recognition, experimenting with various activation functions, hidden layers, and adaptive learning rates. The code for this can be found in the project github.
Developed a fully automated system for processing various printed forms, extracting fields and identifying written content without requiring per-image parameter tuning. Handling inputs like scanned forms, printed-out forms, and booklet photos, the system aligns the documents first and then segments fields and characters. For alignment, it analyzes the form's FFT-based spectral data and uses Canny and Hough transforms to detect edges and correct inclination, particularly effective for colorful booklet images. Form field segmentation starts by normalizing illumination, converting the image to grayscale, and applying adaptive thresholding. Region properties help discard non-field regions, retaining essential fields for character segmentation. Character detection uses Otsu’s thresholding and labels 8-connected components as distinct characters, with morphology-based operations to separate overlapping characters. The system also filters common form symbols, like slashes, by identifying their specific positions relative to field edges. While robust, minor parameter adjustments were necessary for forms with high eccentricity or closely packed characters. This system was implemented using MATLAB. The code for this can be found in the project github.
Implemented the Expectation-Maximization (EM) algorithm to learn parameters for a Bayesian network that models relationships between diseases and symptoms, even when some data is missing. Starting with a Bayesian network structure and partially complete health records, the project used EM to estimate missing parameters, enabling the network to perform diagnostic tasks. The network parameters were initialized heuristically for the missing data to optimize the search for accurate parameters. The code for this can be found in the project github.