From Precision to Efficiency: A New Perspective on Data Engineering

At the recent Data Engineering Summit, Sudarshan Pakrashi advocated for a groundbreaking approach to managing big data. His proposal of using probabilistic data structures could redefine industry norms, adding a fresh perspective to the discourse on data processing and storage.

In a captivating talk at the recent Data Engineering Summit, Sudarshan Pakrashi, the Director of Data Engineering at Zeotap, addressed the elephant in the room for many data-driven organizations: the rising costs of managing and processing ever-expanding data sets. His solution was both simple and profound, changing the perspective on how we handle voluminous data.

Painting the Data Landscape

Pakrashi painted a vivid picture of the challenge, demonstrating the magnitude with a relatable example. Suppose an organization like Zeotap tracks a million users’ impressions across 100,000 ads daily, with each user assigned a unique 64-gig hash key. The amassed data would be around 160 GB daily, ballooning to 50 terabytes monthly.

This scenario represented a single use case, with actual situations involving several analytic needs and a consistently expanding data repository. The underlying problem here was the soaring storage and computation costs, proportional to the raw data size, burdening organizations financially.

Challenging the Convention

The novel approach Pakrashi proposed was based on challenging the need for absolute accuracy. In many operational contexts, like real-time alert systems or reporting dashboards, exact numbers aren’t necessary. Instead, a broad understanding of patterns and trends is often sufficient. This paradigm shift opened the door to new solutions that favor speed and scale over absolute precision.

Probabilistic Data Structures – The Solution

One such unconventional solution Pakrashi discussed was probabilistic data structures, specifically the Count-Min sketch. This is a probabilistic data structure estimating the frequency of various elements in a data stream, like impressions per user in this case. These structures use hashing techniques to allow efficient approximations of counts, trading off precision for dramatically reduced computation and storage needs.

Count-Min Sketch and Markov’s Inequality

Count-Min sketches employ principles from a statistical concept called Markov’s inequality. This bounds the likelihood of an event occurring far from its average occurrence. By judiciously choosing an average and a multiple of it, an upper limit can be defined for the error.

Applying these statistical principles, Pakrashi explained how, when a count in the sketch is estimated to be more than twice the average, the chance of this happening is less than 50%. Thus, the probability of the count being less than or equal to twice the average is at least 50%.

Quantifying the Error

Applying this theory to real-world scenarios involves defining an acceptable error limit, such as 0.1%. Based on this margin, probabilistic bounds can be manipulated to ensure the sketch’s count falls within the set margin with a particular confidence level. The result is an efficient system providing near-real-time analytics within an acceptable error margin at a fraction of the storage and computation costs.

The Impact

Pakrashi concluded his talk by sharing the transformational impact this approach had at Zeotap. Using probabilistic data structures, they managed to optimize efficiency and improve business outcomes significantly. The methodology promised to be a valuable addition to any data engineer’s toolbox, offering an innovative approach to handling and analyzing large datasets.

Conclusion

In an era where data continues to grow exponentially, Pakrashi’s talk at the Data Engineering Summit shed light on an inventive, cost-efficient strategy for managing large datasets. His insights provided food for thought for data engineers and leaders as they navigate the dynamic landscape of data management and analytics. The implementation of such cutting-edge methodologies promises to address pressing challenges and propel the industry forward.

CDO Vision Dubai

26th October, 2023 | TAJ JUMEIRAH LAKES TOWERS | Dubai

Unite with Dubai's foremost Chief Data Officers at an exclusive networking event brought to you by AIM Leaders Council.

Our Latest Reports on Artificial Intelligence & Data Science

  • State of Global Capability Centers (GCCs) in India 2023

    The “GCC in India 2023” report offers a comprehensive examination of the rapidly evolving landscape of Global Capability Centers (GCCs) in India. It explores the different types of centers, including their functionalities and operational aspects. As businesses globally aim to centralize specific functions for better efficiency, India continues to be a preferred destination due to its talent pool and cost advantages.

  • Data Science Skills Study 2023

    In an era defined by the data revolution, the field of data analytics has become the backbone of decision-making across industries. As organizations strive to harness the power of data, the role of data and analytics professionals has evolved into one of paramount importance. The “Data Science Skill Study 2023” by AIM-Research delves into the multifaceted landscape of these professionals, shedding light on their skills, preferences, and the ever-evolving trends that shape their work.

  • Tackling the major roadblocks of text-based GenAI

    In recent years, the field of text-based generative artificial intelligence (AI) has witnessed remarkable advancements, revolutionizing natural language processing and generating human-like textual content. These AI models, such as GPT-3, have demonstrated unprecedented capabilities in generating coherent stories, answering questions, and even simulating human conversation.

    However, within this realm of immense promise, lie substantial challenges and obstacles that demand prudent navigation. As text-based generative AI achieves unprecedented capabilities, it simultaneously encounters complex roadblocks that necessitate careful consideration. These challenges encompass a range of intricate issues that span from accuracy and coherence to ethical considerations and contextual understanding.

    This report aims to explore and dissect the major roadblocks encountered in the domain of text-based generative AI and present effective strategies to overcome them.

     

  • Generative AI Tools: A Comprehensive Market Analysis

    The market for Generative AI tools is thriving, propelled by the expanding applications of these technologies and the growing recognition of their potential benefits. Industries across the spectrum, from tech and entertainment to healthcare and finance, are leveraging these tools to streamline processes, enhance creativity, and make strides in innovation.

    This report aims to provide an exhaustive analysis of Generative AI tools that are dedicated to individual functionalities. By investigating the market dynamics, uncovering trends, and identifying key players, this report offers essential insights into the current scenario and future prospects of these tools.

     

Subscribe to our Newsletter

By clicking the “Continue” button, you are agreeing to the AIM Terms of Use and Privacy Policy.

Supercharge your top goals and objectives to reach new heights of success!