The lack of exact control in generative AI is a huge difficulty for users when it comes to photos and videos; it’s like a “slot machine” approach where various outputs are created in the hopes of getting the desired outcome. This inefficiency is mostly dependent on luck. Furthermore, current models have poor generating rates and lengthy calculation periods, especially when it comes to video creation. The production time of the outputs might range from a few seconds to several minutes, which makes them unsuitable for real-time scenarios and prevents their widespread use.
This week on Simulated Reality, we were joined by Rohit Ramesh, Founder & CEO of Segmind since January 2019. Rohit brings a wealth of diverse experience to technology and entrepreneurship, previously co-founding SenseHawk and HoverX. Prior to his current roles, Rohit initiated ventures like Gadgetronica, demonstrating a strong passion for innovation and business development. Rohit pursued a Bachelor of Engineering (B.E.) degree in Mechanical Engineering at Birla Institute of Technology and Science, Pilani, from 2008 to 2012.
The interview covers a wide range of topics related to Generative AI, including the differences between image/video models and language models, the challenges in developing and deploying these models, and the role of technologists in ensuring responsible use of AI-generated content. Rohit discusses Segmind’s position in the AI ecosystem as a provider of APIs for various generative models, their focus on image manipulation, and their recent expansion into LLMs. He also touches on the potential future integration of generative AI with AR/VR technologies and the current limitations and advancements needed in this space.
AIM Media House: Segmind specializes in advancing generative AI across diverse media formats including images, videos, and audio. Can you share some of the latest advancements and unique challenges in developing AI for these formats compared to text-based models?
“Generally, the whole Generative AI chatter is about LLMs, but there’s a lot happening below the surface.”
Rohit Ramesh: Generally, the whole Generative AI chatter is about LLMs, but there’s a lot happening below the surface. Imaging is more challenging; the dimensionality for image status is much more than LLMs. And if you add the time component for videos, it’s even more complex. So the updates or the new breakthroughs that are coming are in two buckets: one is in the open-source field where you’re getting better visibility into different architectures, what’s working well for what, and there are a lot of breakthroughs happening behind the curtains. For example, Sora is a prime example of that. There is also a breakthrough in the 5-10 video generation problem with different architectures, most probably behind which we’re not aware of. And that’s happening on the video side. Then Stable Diffusion 3, which is a multimodal hybrid model that uses both Transformers and diffusion modules to generate better images. So a lot of activity is happening; it’s hard to sum it up in a couple of sentences. Having said that, I think we’ve seen the fidelity and the speed of generations increase 10x over the last year. I remember seeing SD 1.5 and other tools that generated videos versus the tools that generate videos today are highly realistic, much faster to generate. I think we are getting to a state where fidelity is no longer a problem. I think one problem remains and that’s something that we’re solving at Segmind as well as control over the creation. So instead of it being generally a slot machine-based module where you say ‘here, generate this’ and it generates something that’s coming out of a random black box, people are working towards having more control: ‘take this video, edit this aspect of the video,’ which could be the background, the objects, and so forth. So a lot of updates both architecturally and in the way people are using this for different applications
AIM Media House: In the way people are using generative AI for different applications, we’ve seen significant advancements. Can you help our audience understand where Segmind stands in that ecosystem? Specifically, where does Segmind fit into the entire tech stack? Are you building on existing models or developing your own?
“We started with images, and we’re very strong on the image side. Any kind of image manipulation you want to do, you would find a model on Segmind for that.”
Rohit Ramesh: We started out with a compiler that sped up different ML models, and we’ve been doing minor pivots, but we have not created a foundation model. We’ve used existing foundation models and made them either faster or better in some aspects. Our final layer on top, which people interact with, are APIs. We’ve taken most of the image-based models available, mostly in text-to-image and image-to-image buckets, and provided them as APIs to developers to create their own applications. We’ve seen applications in fashion, gaming, marketing, and a few other use cases use our APIs to develop their own solutions. Of late, we’ve also been adding a lot of LLMs because we realized from our users that using image models in isolation makes it hard to get the desired output. The input to most of these models is a text prompt, which usually needs some manipulation before you send it to the image model. Hence, we’ve added LLMs as well. In a nutshell, we provide APIs to most of the generative models out there. We started with images, and we’re very strong on the image side. Any kind of image manipulation you want to do, you would find a model on Segmind for that. Hopefully, we’re adding LLMs as well.
AIM Media House: When it comes to generative AI for images or videos, you can’t ignore working with text as a format. Given where Segmind stands, can you help us understand the difficulties or intricacies you face when working with generative AI models for images? How does building these models differ from building LLMs, and what challenges do you encounter?
“I’m seeing a trend where image models are becoming more multimodal by themselves, improving the LLM aspects or the text understanding aspects to get better image generation.”
Rohit Ramesh: So the image generation models are typically diffusion-heavy models, which are different from autoencoder, which is transformer-based. There are a lot of major differences, and that’s why we’ve focused on imaging models for the last one and a half years, to make sure that we serve image models as efficiently as possible. This focus makes us one of the best intra-providers for image-based models.
Text is relatively simple and low-dimensional compared to image and video data. Images are typically tensors with high channel, bridge channel, and color channel. Adding one more dimension of time results in videos, and this increased complexity usually requires sophisticated architectures, much more complex than Transformers or the LLMs we see today.
Having said that, the model size after training is usually much smaller on the image side. A typical diffusion model in the global space is between 1 GB and 8 GB in model weights or 8 billion parameters, which is the highest I’ve seen in the open space. For LLM models, 8 billion is the starting point. ChatGPT might be 800 million or 800 billion parameters. The size of the model is very different, and hence, when you deploy these models in production, it behaves differently. You can deploy four image generation models on a single GPU, whereas you might need eight to twelve GPUs to deploy a large LLM model. The computational requirements are also very different.
Text models are now trying to understand images, but they’re not good at generating anything or they don’t generate anything today. Image models are built for the generation of images and to generate good, coherent images based on your prompt, they need to understand text really well. I’m seeing a trend where image models are becoming more multimodal by themselves, improving the LLM aspects or the text understanding aspects to get better image generation.
AIM Media House: What are the implications of the differences between LLMs and image generation models for stakeholders such as developers, CDOs, and executive leadership? How do these differences affect investment and computational requirements? What should these stakeholders consider when working with generative AI for images, audio, and videos?
“As you go into this space over the next few years, you might see a shift or another option available for even text, auto-encoded.”
Rohit Ramesh: Although the model has low parameters the data complexity brings both the complexity or computational requirement for training almost at par. There are different intricacies with training and deploying, but I’ve seen that fine-tuning image models is much more straightforward in terms of fine-tuning for your brand or objects, compared to LLMs, because they’re much smaller and require much less GPU.
If you’re a CTO, the final requirement is defined by the problem statement. If you need to generate images, you’ll have to use image generation models. If it’s a text-based solution, you’ll look at LLM-based solutions. Not much difference in the sense that, as a CTO, everything is defined by the business end. You don’t need to bother too much because current GPU systems can handle fine-tuning or training of both types of models.
Going deeper, LLMs are basically just predicting the next token. The architecture works by predicting the next token, and the whole loop continues until you get the entire output. Diffusion works slightly differently, starting with a noisy canvas and reducing noise to get the final image output.
I’m seeing a lot of R&D in different companies trying to use diffusion to generate text like LLMs. The difference is LLMs are more like mimicking talking, predicting the next token and communicating. On the diffusion side, since you start with a noisy image and generate high, low, and mid-frequency aspects of the image before the final output, it’s almost like a thinking process. It’s like creating a pitch deck, where you first decide the titles of each slide: market size, problem statement, solution, go-to-market, and so on. Then under each topic, you go deeper. That’s how diffusion works. It thinks top-down, going from top-level topics to lower-level topics.
This is still in R&D. We haven’t seen a diffusion model that generates text at scale, but there’s a lot of R&D happening. As you go into this space over the next few years, you might see a shift or another option available for even text, auto-encoded.
AIM Media House: Who holds the responsibility for the usage of AI-generated video content, especially in high-risk areas like political speeches: technologists or policymakers? Can technologists mitigate misuse, and what measures can they take? Does the onus lie solely with policymakers, or also with technologists?
“There are two buckets: those who create the model can embed safety during training, and aggregators like us who curate different models and have a guardrail system to protect the community.”
Rohit Ramesh: Technologists are more in control of the situation, or have more power to control this narrative. Policymakers can only do so much as bringing in a policy and saying, ‘This is illegal’ or ‘This is legal.’ I’ve been in the drone space, especially in India, for the last 10 years, where technically drones are banned, but everyone uses them. Policymakers can set the guardrails, but to enforce those guardrails, technologies are needed. Whoever is creating these models needs to think about the safety of these models. That’s one of the reasons why many people are debating closed source and making sure their models are not easily available for malicious intent.
We are riding the wave of open source models at Segmind. We take these open models available in different forms in repositories or as big files and create APIs for developers to consume. We need to work with those who created the model to understand if there’s something we can add to make the model safer. Then we add a layer or guardrail around this model, common across different models, for another layer of safety. This ensures malicious intent users are either blocked or at least a signature is added to the creation – video or image – that says, ‘This was created by this person at this time using this service.’
There are low-hanging solutions where we’re adding watermarks to all outputs that are hard to remove, digital signatures embedded in the latent space of the image, and metadata to the image file. It’s easy to remove metadata, but 90% of users will be caught if they don’t understand metadata, signatures, and most malicious users haven’t done that research of crafting their intent of spreading fake news.
Technologists have more to do in this space. Policymakers ensure the direction, but implementation falls to technologists. There are two buckets: those who create the model can embed safety during training, and aggregators like us who curate different models and have a guardrail system to protect the community.
A good anecdotal story is when we introduced face swap GIF a year back. We thought people would use it for funny memes, but we saw many NSFW use cases. We immediately stopped the playground page where it’s a simple GUI to do this whole process and created an onboarding process where users submit why they want to use this model. Only when we validate it for a legitimate use case do we open that API for them. But it’s an interesting perspective that the onus lies heavily on technologists as well.
AIM Media House: How has the evolution of data analytics from traditional BI to advanced analytics and machine learning informed the scaling of Generative AI applications? Specifically, what components and considerations are necessary for scaling Generative AI for videos to ensure they are effectively used at scale and not siloed?
“If you use Dall-E today, you’ll see that on average, it takes 10 to 20 seconds to generate an image, and that still holds back the usage of this model at scale.”
Rohit Ramesh: We see two areas that still need to be improved before we see widespread adoption. The first one is control. What I mean by control, including images and videos, is that today it’s more of a slot machine based system. You put in a prompt, you get four images. You say, ‘I like this, but I don’t like this aspect.’ It uses that as a starting point and nudges you in the direction of the right image or video, but you’re not able to control it precisely. You can say, ‘Create a photo of XYZ walking on a jogging track,’ then say, ‘I like this photo, but change the color of his pants to denim,’ and go step by step to get to your end image. Rather than leaving it to luck or chance, generating images and looking at four, then generating four more, control needs to be built into the model or overall system. This way, creators or consumers of these models aren’t playing a guessing game of maybe getting a good output or keeping trying until they do. That’s a big blocker, and that’s what Segmind is working on. I talk about our tool called Pixel Flow, where we’ve built a simple, no-code interface where you can create something and edit it in a direction that gives you the final output you need, instead of relying on luck.
The second important blocker today is computation, both in terms of the computational requirement and the time it takes to create, say, a video – at least four to five minutes. Creatives want it much faster. When we came into this space, we came in with a library called Volta ML, which took image generation on stable diffusion models from 10 seconds to almost a second. We went down to almost 20 frames per second on leading GPUs. That changed the game – you could literally type and get outputs in real time. The possibilities that opened up for creators thrilled them. The same thing is lacking today in videos, and even in most image models, none of them are real time. Most take five to 10 seconds. If you use Dall-E today, you’ll see that on average, it takes 10 to 20 seconds to generate an image, and that still holds back the usage of this model at scale. These are the two important factors I feel will improve over the next few years.
AIM Media House: What advancements are needed for Generative AI applications to integrate with technologies like AR/VR in the future? Considering the current limitations with AR/VR device costs, if these become more affordable, how might these technologies merge? What other factors and trends are crucial for further evolution in this space?
“I see a world where, 10 years from now, you could be having games where your environment is completely generated according to your scenarios, while my environment is generated based on my scenarios”
Rohit Ramesh: Eventually, I’m sure generative AI will make its way into AR/VR, but I think they’re two different new-gen technologies trying to evolve on their own. They will meet somewhere. A lot of use cases are already being powered by Gen AI models on AR/VR use cases. But since both are new technologies, they need to evolve independently before one uses the other. Having said that, I’ve seen many use cases where people are generating real-time environments, characters, virtual characters, user interfaces, using generative AI models. Today, it’s usually a setup where these models are created beforehand on a powerful machine, and then the real-time environment generator creates the environment, which is eventually used in an AR/VR setup. Once these models are nimbler and smaller, and the processing power on these headset devices improves, you’ll see these models directly deployed on the devices, creating these environments in real-time. I see a world where, 10 years from now, you could be having games where your environment is completely generated according to your scenarios, while my environment is generated based on my scenarios. I can generate my own characters and so forth. But I think there’s still some more time before we see that.