As the keynote speaker at the Data Engineering Summit 2023, Rittika Jindal, a seasoned Cloud Architect at Thomson Reuters, enlightened the audience about the development of a dynamic and autonomous data transformation pipeline. The tech stack comprised AWS Fargate containers, DBT, and Airflow, focusing on the ‘T’ in the ETL or ELT pipeline. This pipeline equips businesses to construct their data marts independently, relying on SQL/DBT and creating dashboards for rapid insights without depending excessively on data engineers.
The Nature of the Transformation Pipeline:
Jindal emphasized the concept of a transformation pipeline, distinguishing it from conventional data extraction and loading pipelines. A transformation pipeline primarily targets the ‘T’ in ETL, assuming the data is already within the system and focusing solely on transformation. Thomson Reuters uses this pipeline to drive autonomy in their data analytics, incorporating technologies like DBT, Airflow, and AWS Fargate containers.
Promoting Autonomy in Data Analytics:
The primary goal of this transformation pipeline is to foster autonomy within Thomson Reuters’ data analytics team. Jindal explains that multiple teams operate in their data domain: the data engineering team, responsible for building data pipelines and maintaining years of data activity, and the analytics team, working closely with business units to generate insights, models, and dashboards. To bridge the gap between raw data and interpretable data for analysts and data scientists, Thomson Reuters devised the data transformation pipeline. This pipeline empowers the analytics team to transform data and build their marts, leveraging simple SQL queries.
Choosing the Right Tools: DBT and Airflow
After settling on the idea of a data transformation pipeline, the next challenge was choosing the right tools. After weighing multiple factors, including licensing costs, upskilling requirements, and ease of use, DBT emerged as the best fit for data wrangling and transformation. Open-source and user-friendly, DBT’s sole focus is on transforming the data already present in the system.
For workflow orchestration, Jindal and her team considered various tools, settling on Airflow. They needed a tool that could not only orchestrate DBT tasks but could also handle Python codes and third-party API calls. Airflow, as a platform that allows you to script your workflows using Python, complemented DBT well and fulfilled the requirements.
Implementing the Solution:
The final step was to integrate everything into a seamless architecture, adhering to the architectural principles of scalability and high availability. Thomson Reuters prioritizes building solutions on cloud and prefers serverless designs. This strategy led to a solution architecture utilizing Amazon ECR for storing Docker images and AWS Fargate for running serverless containers.
The architecture essentially works as follows: The Docker image containing DBT and Python is stored in the Amazon ECR. From there, Amazon ECS Task Definitions are created, which launch Fargate containers. Each container contains the DBT image, runs one model at a time, and then shuts down. The entire process is orchestrated through Airflow, making the system efficient, scalable, and high-performing.
Conclusion:
Rittika Jindal’s talk shed light on an innovative approach to data transformation, emphasizing the importance of autonomy in data analytics and making intelligent choices in tool selection. By implementing a data transformation pipeline using DBT, Airflow, and AWS Fargate, Thomson Reuters has efficiently streamlined their data analytics process, minimizing dependencies and accelerating insights generation.