I'm Sabariswaran Mani (Sabarish), a fourth-year undergraduate at the Indian Institute of Technology, Kharagpur. My interests lie in computer vision and robotics, with a focus on autonomous ground vehicles and generative imagery. I'm passionate about bringing these technologies to real-world applications. Beyond research, I have a long-standing love for travel—scroll down to see some of my adventures!
Currently, I am a researcher at the Vision & AI Lab, IISc Bangalore, under the guidance of Prof. Venkatesh Babu, where I explore diffusion models and their applications. I'm also a member of the Autonomous Ground Vehicles Research Group and Quant Club at IIT KGP. Beyond research, I enjoy football, computer games, and a good plate of parotta.
Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled W+ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the W+ space, we train a latent mapper to translate latent codes from W+ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
Robot learning tasks are extremely compute-intensive and hardware-specific. Thus the avenues of tackling these challenges, using a diverse dataset of offline demonstrations that can be used to train robot manipulation agents, is very appealing. The Train-Offline-Test-Online (TOTO) Benchmark provides a well-curated open-source dataset for offline training comprised mostly of expert data and also benchmark scores of the common offline-RL and behaviour cloning agents. In this paper, we introduce DiffClone, an offline algorithm of enhanced behaviour cloning agent with diffusion-based policy learning, and measured the efficacy of our method on real online physical robots at test time. This is also our official submission to the Train-Offline-Test-Online (TOTO) Benchmark Challenge organized at NeurIPS 2023. We experimented with both pre-trained visual representation and agent policies. In our experiments, we find that MOCO finetuned ResNet50 performs the best in comparison to other finetuned representations. Goal state conditioning and mapping to transitions resulted in a minute increase in the success rate and mean-reward. As for the agent policy, we developed DiffClone, a behaviour cloning agent improved using conditional diffusion.
Topped the Leaderboard with a Mean Reward of 12 and Success Rate of 62% on online Franka Panda, 52 and 91% respectively in Mujoco Simulator for Pouring task, outperforming the top competitor(BC+ MOCO).
Proposed DiffClone for Offline Behavioural Cloning, using conditional DDPMs for offline visual policy learning.
Developed a method to predict tweet popularity using metadata, text analysis in a unique non-contextual approach.
Fine-tuned mPLUG-Owl2 and Llama-2 with policy-based reinforcement learning for Bandit-informed, personalized tweet generation, maximizing engagement metrics like likes through routed LLMs.
Developed a prototype Industrial level dimensioning system using live 3D point cloud data from Depth camera.
Reduced error to just 0.2 percent for all three dimensions of any random shaped object moving on a conveyor belt system.