AI Innovation in a Startup vs a Large Enterprise With Sushant Hiray

By Anshika Mathews
Published on March 11, 2024

CDO Insights

For me, it's about two things: staying abreast of what's happening and making them aware of these technological shifts, and positioning ourselves as a leader in the industry to capitalize on the new wave of AI.

AI innovation manifests differently in startups versus large enterprises, reflecting their unique organizational dynamics and strategic priorities. Startups prioritize agility and disruptive solutions, leveraging AI to streamline operations and drive rapid growth. In contrast, large enterprises emphasize scalability and integration, harnessing AI to enhance existing processes and maintain market dominance. Understanding these distinctions is crucial for navigating the AI landscape effectively in both startup and enterprise contexts.

To give us better insights on this for this week’s CDO Insights we have Sushant Hiray who is a seasoned AI researcher and data science leader, specializing in architecting cutting-edge solutions. As the Senior Director of Machine Learning at RingCentral, he drives the development of the next-gen Conversational Intelligence Platform. Previously, Sushant co-founded DeepAffects, later acquired by RingCentral, and led data engineering at Lumiata. He holds a Bachelor’s in Computer Science and Engineering from IIT Bombay, with notable scholarships, and actively contributes to open source projects as both a Google Summer of Code Student and Mentor.

In the interview, Sushant will discuss the journey of DeepAffects, including its inception, expansion to India, and acquisition by RingCentral. He’ll explore factors driving startup founders to consider acquisition and share insights on collaborating with top executives in Silicon Valley. Sushant will compare challenges faced by startups and scaling companies in AI, address tactical versus strategic aspects of entrepreneurship, and provide advice on scaling AI products. Additionally, he’ll touch on staying updated with technology trends and the future of emotion recognition technology, considering ethical and cultural considerations.

AIM: What was the idea behind DeepAffects and could you walk us through its journey from inception, including the decision-making process involved in starting it, expanding operations to India, and ultimately being acquired by RingCentral?

Sushant Hiray: I worked for a startup, graduated from IIT Bombay, and was always interested in taking the path people typically did not take. You want to join Google or try something new. What’s the worst that can happen? That was the idea where I met my other co-founder. He was the founder of the company called Lumiata. We were into predictive healthcare analytics. There, I essentially built an AI organization where we focused on analyzing massive amounts of US healthcare data. From a technology standpoint, healthcare is super exciting. But we also faced many interesting challenges regarding the lack of adoption from bigger healthcare like hospitals and insurance providers. The sales cycle were too long, and it was not super exciting for us. So we decided that we wanted to try something else. That’s when we quit Lumita and decided to focus on something else.

Back in Lumiata, we also had a lot of interesting contacts in the healthcare industry. We were working with a few universities and trying to help them with this problem of predicting depression among college students. Most students don’t seek help. Are there ways to identify markers for depression within your voice when people are calling? When we started working on that problem, we realized that many foundational aspects of AI are missing when we want to process speech. Back then, the only technology available for the masses was speech-to-text. But if you wanted to analyze different markers within speech, those were missing, so we dived deeper into it and then realized that there is a foundational layer which needs to happen beyond speech-to-text, which can help us analyze a lot of this data.

The first problem we faced was that the recordings we had were a single channel, so if there was a recording with two speakers, we didn’t know who was speaking when and hence we had to split this recording. The original thought was this should be a fairly straightforward problem. I was sure somebody has already solved this. Once we started diving deeper into it, we realized there is no solution beyond academia for this particular aspect of the problem. Academic solutions were all based on academic datasets. They did not scale well on real-world challenges of speech. Noisy data and lack of good quality data were the key challenges. So, that’s when we started focusing on Speaker Diarization. It’s a fairly academic term, but it essentially means you’re splitting the audio into “who spoke when”. We had a website and were dabbling with creating different APIs, so we just created our website and added the Speaker Diarization API. It was pretty interesting that there were a lot of hits to that particular page. We realized that if somebody comes to our website for this particular search string, it means that they know what they’re looking for. That’s when we came up with a fit for the market: hey, there are people out there looking for this as a service. That’s where we started doing all of the engineering we created and worked on how we now solve this problem. Not just for two people, but can we do it for video meetings? Can we solve it for four or eight people in a meeting or much bigger? That’s when we realized there is a large space in business communications where you want to analyze what people are talking about. And there has been a fundamental challenge in how you could analyze this data. We created this foundational layer of what was being spoken but who spoke what, when and how, how they spoke with emotion, and how they expressed those things. And so we created this whole suite of API’s for telling you anything and everything you would need from this given meeting, recordings or audio recording. And we had very interesting use cases. Podcasts were also getting pretty active back then. People loved the podcast, but the transcripts were inaccurate, and you wanted to know who spoke when and what in the podcast. Contact centers were another massive use case for us to understand what the customer or the agent was saying. And so that created the whole perfect story.

AIM: When does a startup founder decide to get acquired, and what are some of the factors that make sense when making this decision? Conversely, when should a startup founder decide not to get acquired?

Sushant Hiray: For us, it was also pretty interesting. And, in hindsight, it was basically that we were at the right place at the right time. COVID just happened and suddenly the technology that we’re working on had massive use cases. UCaaS companies like RingCentral and Zoom had a phenomenal set of new users coming in because businesses needed to talk to each other. RingCentral was a customer of ours back then, when they just launched this video product and they were trying to do some AI along with that. Since then, we also built a lot more on our API stack. We dabbled with summarization of meetings back in 2017 – 18. Transformers era was interesting, things were happening, but pre LLM. However, there was not just one model which could solve all the problems. So we did a lot of cool engineering to try and summarize meetings. So for us, it was an interesting mix where we had a very interesting technology. But as a startup, we knew that we needed this technology to reach out to the masses. We needed a really great distribution, which we did not have. And so that was one of our key decision making factors for us to either stay as an API company where we focus on different use cases or do we sit together with a communications platform and become a native part of the communication platform, and with the potential of completely revamping how communication happens. RingCentral, was already using our APIs and then they were also interested in having a long term collaboration where we build something custom for them, which eventually panned out to us just getting an acquisition offer.

AIM: How can aspiring entrepreneurs in institutions like IIT Bombay or colleges start reaching out and working with top executives in Silicon Valley to gain insights and collaborate on AI projects, especially when aiming to understand problem conceptualization, find scalable solutions, and engage with potential customers effectively?

Sushant Hiray: It’s a pretty interesting problem. The good thing is, there are a lot of enterprises today, because of the AI hype, they’re more approachable. If you have an interesting enough technology, you might have some doors open. They might not want to buy it outright. Cold emails definitely work. For us, what helped was they were looking for something which we were offering and nobody else was offering, which was unique in our case. But not necessarily which always happens. But it’s a mix of two. One is you do have to reach out to a lot of places, especially for example if you’re from IIT Bombay, your network is so amazing. The IIT Bombay community here in the Bay Area is phenomenal. You can find an executive in every single company that you will think of and you will find some contacts within the network. That’s one of the best ways to get those intros going in and in general people are super helpful. Alumni networks definitely helped open some doors if you’re right out of college, you don’t have your own network. If you have some experience or maybe one of the founders has larger industry experience, they can use their own social contacts to open up some of the doors but cold calling or rather cold emailing, or just really creative LinkedIn InMails also work. We got a lot of our customers by doing a lot of these things that don’t scale well, but for your initial set of customers, you want to be creative in terms of knowing who is your end goal, end user, and you want to target them.

AIM: What are the key differences in the set of challenges faced by a startup founder building an AI product from scratch with limited resources compared to a company like RingCentral that is in the scaling phase?

Sushant Hiray: I would approach it in two categories: one is money. As a startup, depending on how much money I’ve raised, you may or may not have much money. You may or may not have access to GPUs, which limits what you can do as a potential AI company. But then again, there are many creative ways to get GPU access: Google and AWS credits give you enough to shuttle around. We were still doing a lot of DeepTech, where we needed to train our models, and we did lots of creative stuff. We started on GCP, migrated all our workloads on AWS when we got AWS credits, and migrated all the workloads to Azure when we got Azure credits, so we just built this whole stack, which was so easy to migrate on and off clouds. For us, it was just a matter of having credits. So that is one way because you have limited capital; you need to be creative in training your models.

There are still some changes with LLMs; you may or may not need to train your models, and that bottleneck has gone away. But it also means you can no longer work on superficial items because it’s easy for somebody else to do the same thing. So, you need to dive deeper into the problem statement to determine precisely where you add the correct value as a company. That is one aspect of it.

Engineering involves the aspect of how you do this well. The other aspect is data. How do you get data? Where do you get the data from? You cannot purchase data; if you don’t have enough money to purchase data. How do you get the good quality data that you care about? We used to generate synthetic data. Now, we are at an inflection point where generated data itself is becoming a category. People are creating superficial data, which is good enoughSo those were the key challenges which we had.

Talent is the biggest one, and how you convince somebody to join a company. As a founder, especially if you’re looking for a great engineer, the biggest skill you need to learn is how to sell, and even if it’s a technical co-founder, you need to sell it to your employees. You need to sell it to the corresponding engineers.

It’s the art of selling, which people don’t know, and you learn it on the job. You get into it and then slowly figure out how the pitch you will give to a VC is entirely different from the pitch you will provide to a prospective employee, and you need to craft it in a way that matches what the other end person is looking for.

On the contrary, we have entirely different challenges at the RingCentral scale. We start with legal reviews. We are a global company, so you have GDPR, data residency rules, and data retention policies. You cannot build an app in a silo and hope it works. Because you have people’s trust and the foundation principle, even within RingCentral, trust is one of the most important factors because people use RingCentral phones as the primary line. It’s the single point of failure for them. If the phone goes off, business stops. For us, how do you treat this data carefully? How do we ensure the customer is okay with us running AI models? They need to be aware that nobody has access to this data. How do we create this whole pipeline? Make sure it scales well because, as a startup, you don’t necessarily need to worry about 1 million concurrent requests. You may not even reach that level in the initial stage, but as you go into the RingCentral scale, we process a few million voicemails daily. The volume of calls that go through is significantly high, which helps us realize that at this scale, many decisions that you might have made earlier in the days when you were a startup might not necessarily make sense. They won’t necessarily scale up at this huge scale.

AIM: As an entrepreneur building AI solutions, how do the tactical aspects of work, such as coding and selling ideas, differ from the strategic aspects, like driving frameworks and architecting solutions, both when starting off and in the scaling phase of the business?

Sushant Hiray: Many people are deeply immersed in AI, and one of the common questions we often receive is about the rapid pace at which AI is evolving. Every day, new sets of papers emerge. How do you keep up? If your day job involves deep technical work and reading papers, you might still manage to cover some, but not all. In my current position, I barely have time to read all those papers. So, one of the key challenges is staying up-to-date with industry developments.

With all the hype surrounding AI research, there’s indeed a lot of amazing work happening, but a significant portion is also incremental. I have focused a lot on figuring out those inflection points. At what point is something critically different than what existed before which warrants us to invest in doing this? These are the areas I try to focus on.

There are a lot of details where you rely on your team to try and figure out if the marginal 0.1% matters much to us. In some cases, it does; in others, it does not. So you make some of those decisions.

For me, it’s about two things: staying abreast of what’s happening and making them aware of these technological shifts, and positioning ourselves as a leader in the industry to capitalize on the new wave of AI.

AIM: As a leader in the field of AI entrepreneurship, how do you ensure you stay informed about the latest technologies and advancements, especially when transitioning from more technical roles like coding and data science to strategic leadership positions? Do you rely solely on hiring experts or do you actively engage in staying updated with the latest trends and developments?

Sushant Hiray: I really would like to hire people to keep up with. Unfortunately, it’s not as easy as that. But that is one part: you rely on your team to propagate some of this information. Let’s take some examples of LLMs, RAG techniques have become popular, and I have not tried any of them. I can dabble around with it. But then that’s where you lean in on your team to try and figure out how to do a bunch of these experiments and figure out which works and what does not.What makes sense, what does not make sense, or why about it. And that’s where your team composition also matters a lot.

And the second part of this is also your peer network. What helps a lot is that even if I’m not personally building some of this, still, my peer network comprises academics who are now professors and are now doing some of the state-of-the-art research. Being in touch with these people, being surrounded by them, and bouncing off ideas helps you understand where the landscape is shifting and where things are headed. There is a lot of noise, and you’re just looking for the signal within this noise today. A lot of it is also calibration for myself. I have also been calibrating my models to look out for signals from this myriad of noise that’s happening. But I’ve become much better over the years at picking up signals. Part of it is being thankful to my great team, who can distill some of this information.

AIM: How do you assess the potential of an AI idea for scalability and commercialization, and what indicators suggest that an idea may not be viable? What advice do you offer to entrepreneurs regarding the decision-making process for scaling AI products?

Sushant Hiray: As engineers, we resort to getting into the fallacy of, this is an amazing technology and I need to figure out what to do with this. It’s really hard to get away from that mindset. But once you start looking only at a problem at a problem’s merit, it does not matter whether you need AI for this or not. It could just be a small form that might solve a similar problem, but a problem does exist. Are people willing to pay for a solution for this? And this is where you refine your art of figuring out how much people, when they say they’re willing to pay for it, actually pay for it. That’s entirely different because when you’re having a normal conversation, you could have ten points. There are four or five things that we notice and do better as a quality of life. Now, are you willing to pay for that would be the question. In the grander scheme of things, everybody wants to buy things which make their life easier, but again, it becomes if this is a good thing to have? Is it like a must-have?

The customer may or may not always tell you that honestly. And it’s up to you to figure out what those signals are to judge whether this makes sense or not. Once you’ve zoned in on the problem, how do you devise a solution for this problem becomes like the secondary aspect of it. Even for us when we were doing DeepAffects, we stumbled upon a problem and realized this was a much bigger problem than we thought it was, which may not always happen. But in general, it’s about looking for those things. As an engineer our workflows are so common, and we do a lot of things. Many of these are annoying things that we might not change, repetitive tasks, and many people are trying to work on them. But in the end, one of the things that you need to look for is for any business to be successful, it needs to make money. If you cannot make money, no matter how much money you raise, and you just delay the decision of how to make money from this, in the end, it might not work. Their chance of success exponentially increases if you can make some money out of it. And then, whether this is a problem for VC scale, or is it a problem for a lifestyle business, both of them are great choices. It just depends on how you want to approach them. Because there are many problems, just because of huge amounts of VC capital, you cannot just scale at that point and you have to shut it down. But many of those problems could be good as lifestyle businesses where it cannot go at 1,000,000x, but it was a good problem for 100x, and there are enough people who can build this or can buy this.

AIM: How do you handle the complexities of emotion recognition through AI, given the ethical and cultural considerations involved? What steps are you taking to ensure inclusivity and accuracy while collaborating with legal teams? Additionally, what is your perspective on the future of emotion recognition technology?

Sushant Hiray: Historically,I used to look at emotion recognition as a problem to be solved from a data science standpoint, I have a signal to figure out what the question is. Since then, I’ve changed my approach a little bit.

When a customer says they want to detect emotion, what exactly are they looking for? Are they only interested in looking at instances when a customer is angry at the contact center? Are they looking for some other instances?

And we’ve worked a lot with emotion detection. We’ve published a few papers over the years and have had a couple of patents filed around the same area. But it’s not super accurate enough for it to fundamentally shift things. There are different theories of emotion recognition. Some say there are six different emotions. Some say there are eight. Some say forget about all of this, let’s just do positive or negative sentiment and get away with it. In general, rather than trying to fixate on the emotional aspect of it, we are focusing on the outcome, like what is the end goal and how we can reach that end goal, because this is just a tool, this is just technology. However, there are significant challenges in terms of validating these models across a lot of cohorts.

The cohorts could be just male versus female; the cohort could become much more complicated regarding different races. The cohort could be complicated in terms of just how loud or how your voice’s pitch is, which slightly correlates with your gender. So many of these cohorts and academia is now getting there in terms of, how can they standardize all of this. Can they get to a point where they have solid benchmarks for not just emotion recognition but a lot of AI models to identify what is bias within the model? And for us as a company, bias is one of the key aspects, and fortunately, a lot of European legislators have already started thinking about this; what is the outcome? What is the output of this model impacting somebody’s livelihood? And then you need to do a lot of scrutiny. For example, if the end goal is just summarizing a meeting, it won’t change anybody’s outcome of the day. It depends on what the result is. And what are you using the result for? If you’re a recruiting company, you have an AI model that passes resumes and says yes or no, but that has a lot of pitfalls in terms of whether we have just done enough, like testing around different types of applicants.

We’d categorize every single model. This is the model. What is the purpose of this? What the end goal is, and we outline the entire journey of the data lifecycle for the customers.

We don’t have a social scientist in my team, but RingCentral does hire a lot of consultants that analyze these outcomes for us. And not just for AI in particular, because we are a phone and communications company, we do a lot of analysis regarding accessibility of the phone technology and accessibility of creating video meetings versus what the impact is and the social impact for them. RingCentral also does a lot of interesting work in this area.

📣 Want to advertise in AIM Research? Book here >

Anshika Mathews

Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co

Subscribe to our Latest Insights