Leveraging Airflow To Build Scalable and Reliable Data Platforms at 99acres.com with Samyak Jain

The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI - A podcast by Astronomer - Thursdays

Categories:

Data orchestration is evolving rapidly, with dynamic workflows becoming the cornerstone of modern data engineering. In this episode, we are joined by Samyak Jain, Senior Software Engineer - Big Data at 99acres.com. Samyak shares insights from his journey with Apache Airflow, exploring how his team built a self-service platform that enables non-technical teams to launch data pipelines and marketing campaigns seamlessly.Key Takeaways:(02:02) Starting a career in data engineering by troubleshooting Airflow pipelines.(04:27) Building self-service portals with Airflow as the backend engine.(05:34) Utilizing API endpoints to trigger dynamic DAGs with parameterized templates.(09:31) Managing a dynamic environment with over 1,400 active DAGs.(11:14) Implementing fault tolerance by segmenting data workflows into distinct layers.(14:15) Tracking and optimizing query costs in AWS Athena to save $7K monthly.(16:22) Automating cost monitoring with real-time alerts for high-cost queries.(17:15) Streamlining Airflow metadata cleanup to prevent performance bottlenecks.(21:30) Efficiently handling one-time and recurring marketing campaigns using Airflow.(24:18) Advocating for Airflow features that improve resource management and ownership tracking.Resources Mentioned:Samyak Jain -https://www.linkedin.com/in/samyak-jain-ab5830169/99acres.com -https://www.linkedin.com/company/99acres/Apache Airflow -https://airflow.apache.org/AWS Athena -https://aws.amazon.com/athena/Kafka -https://kafka.apache.org/Thanks for listening to “The Data Flowcast: Mastering Airflow for Data Engineering & AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.#AI #Automation #Airflow #MachineLearning