Airflow

To Manage Data Pipelines

Presented by  Mahendra Yadav

Azri Solutions

Mahendra Yadav

Data Engineer at Azri Solutions

Introduction to Airflow

Airflow is a platform to programmatically author, schedule and monitor data pipelines.

Data Pipeline

Existing challenges

cron image
  • Hard to debug and maintain.
  • Lack of stats
  • Logging
  • Retrying

Main Advantages

Stats

stats

Logs

stats

Handle task failures

stats stats

Alert on failures

stats

Monitoring of tasks over time

stats

Pipelines written in Python - Code View 1

Pipelines written in Python - Code View 2

Basic Concepts

Workflow as DAG

  • DAG (Directed Acyclic Graph) is a collection of all the tasks
  • Reflects the relationships and dependencies between the tasks
  • Addition of new DAG is very easy.
dags view UI

Tasks as Operators

  • BashOperator
  • PythonOperator
  • PostgresOperator
  • MySqlOperator
  • ExternalTaskSensor
  • EmailOperator
  • SlackOperator
  • And many more...

Scheduling the DAGs

DAGs can be scheduled to run at certain frequency by providing the parameter schedule interval.

  • Cron expression
  • Python datetime object

Executors

Executor is responsible for execution of the jobs.

  • Sequential (Default)
  • Local
  • Celery

DAG Anatomy

Default Arguments

Instantiation

Tasks

Airflow in practice - case study

ETL Pipelines

Download Data

Process

Load

Send Report

Demo

Email

Who uses Airflow

Resources

Thank You!

Mahendra Yadav