Designing an ELT Data pipeline with Airflow and AWS Redshift

Sebastien Sime
6 min readJan 25, 2021

There are hundred ways of creating data pipelines across any type of organisation. Ensuring such pipelines, allow any business to remain agile by allowing data to circulate and to be available whenever necessary to take appropriate decisions.

In addition to ensuring the data circulation, monitoring the data workflows at each step is another strategic problematic that needs to be addressed. Besides, nowadays in the big data era, we no longer (rarely, or less and less) store data or work directly on our PC or on flash drives. The cloud infrastructure is finding its way up with services such as Google Bigquery or Amazon Redshift. So, with this types of infrastructures and cloud services, how is it possible to design and monitor a data pipeline using Apache Airflow and Amazon Redshift? The implicit question being why Airflow and redshift?

Since the beginning of my journey into the Data Engineering field, I have been wondering about the real difference between ELT (Extract, Transform and Load) and ETL (Extract, Load and Transform) data pipelines. What is really the difference?

That’s all the question I will be answering (plus more) by the end of this story. To do that, I will use a recent project of mine that consisted in designing a data pipeline for a music application.

Buzzwords: ETL\ ELT, Data Warehouse.

Before going on, to keep it short and straight to the point, all the above buzzwords are more than well defined in The Analytics Setup Guidebook written by a team at holistics.io.

The Analytics Setup Guidebook
The Analytics Setup Guidebook

Regarding the Data Warehouse, the latter can be seen as a centralized place where it is possible to store and process the data as well as access to perform some business analytics. On redshift, it consist of relational database with special features such as Columnar storage engine, Compression of columnar data and Parallelization of query executions.

Business context: why Airflow

I have been working with a music app development team interesting in processing the data generated by the application. The project has already been through several steps:

  1. The creation, at the early development stages, of the relational schema to store the generated data into a relational Data base in Postgres.
  2. The design of an ETL data pipeline using python to process and store the data into the previously created schema.
  3. Later with the data volume increasing, the creation of a data warehouse with the associated ELT process. At this step, we worked with both AWS Redshift and the Anaconda python distribution.

Long story short, at the fourth step we are interested in not only processing the data but also literally viewing the status of all the data processing steps in a unified environment. It is for this matter that Airflow comes in handy.

The Airflow web interface makes it very easy to assess the processing evolution and check status of processing tasks. However, the pipeline has to be designed with respect to the Airflow functions and DAG formalism with key points to have in mind:

  1. The pipeline has to be designed like a project in project management (divided into tasks first). An open succession of tasks is called a DAG (Discrete Acyclic Graph).
  2. Each task must ideally designed to perform one task and one task only. And each task should be associated with specific functions. It is better to directly use provided Airflow functions as they are already optimized but custom functions can be created to perform some specific actions.
  3. The tasks should be logically ordered and partitioned to achieve the desired result.

The business case.

By using a simple python script, we cannot properly monitor the process, so let’s remember that the objectives are:

  • To create a Data pipeline to process the data of the music application.
  • To monitor each steps or tasks to constantly have an view on the process evolution.

The Data Engineering solution.

The music application manage songs and generate users activities data. The songs metadata as well the users logs are stored in a AWS S3 bucket. The process should then:

  • Retrieve the data on S3 bucket.
  • Store the data into AWS Redshift staging tables.
  • Fill a Kimball’s dimensional data model (a relational schema optimized for aggregated queries) consisting of one fact and four dimension tables (see the picture below).
Designed Kimball’s dimensional data model (Star schema) with staging table
  • Perform some data quality checks to guarantee the data quality.

To meet each of these requirements, I created custom functions in Airflow to:

  1. connect to AWS and Transfer data from S3 to redshift.
  2. Load the fact table.
  3. Load the dimension tables.
  4. Check if all records are not null.

Each of these functions, are associated with a task ordered as it follows:

Airflow 1.10.14 DAG with python 3.6 on windows 10 WSL

As you can see on the picture above, some tasks are performed in parallel (to load the staging tables and the dimension table). That means that the associated tasks have been logically portioned to speed-up the process.

Airflow records logs for each task to assess the execution status separately. With all task being green means that more than hundred thousand of records have bee processed and checked as well.

Tips when coding in Airflow.

If you are interested in the code that yield the above result, you can check my Github repository (ssime-git/Sparkify_Airflow_ELT_Pipeline: ELT Pipeline automation with Airflow (github.com)).

Things that you want to do when coding in airflow is to work at the same time with:

  • The web interface: to see after refreshing if the DAG version is correct.
  • Your favourite python editor: to edit and save the code.
  • The running scheduler (in a WSL windows): that can display more detailed errors.
  • And the running webserver in another WSL windows: to display more detailed errors since the web interface doesn’t display the full source of error.

Make sure to refresh the DAG in Airflow after any modification of the python code.

Conclusion.

Working with Airflow makes it very easy to monitor the data pipeline and to check the global status of the data pipeline project as well as the individual status of each task.

However transitioning from ETL \ ELT python scripts to Airflow can take some time since one needs to understand the Airflow formalism and processing technics.

Hope that you find it helpful for your project to come true.

References.

Introduction | The Analytics Setup Guidebook (holistics.io)

Apache Airflow

--

--