The content of this blog is just based on my own experience, which means it could be somewhat subjective. The main reason that I take the program is just because I want to consolidate and polish some knowledge in the data engineering field.
The Data Engineering Nanodegree program has five core pillars: Data Modelling, Cloud Data Warehouses, Data Lakes with Spark, Data Pipelines with Airflow, and the Capstone Project. The upcoming content will cover all of these pillars, respectively. In the end, I will express my view about it.
This module mainly introduces the concept of data modeling in terms of two major parts — Relational Database (PostgreSQL) and NoSQL Databases (Cassandra).
For the Relational Database part, it covers OLAP/OLTP, Normalization/Denormalization, and Fact/Dimension Tables. For each topic, there will be corresponding exercises in Python and PostgreSQL. These contents and exercises are fairly easy to be understood, and some of them could actually be a good refresh since it’s not possible for me to always have chances to do schema designs in my daily work. One suggestion that I’d like to make is maybe the course team should provide some more direct/better data catalogs for the homework project.
For the NoSQL part, it covers CAP theorem in general, and then all the content is related and only related to Cassandra. One credit should be given to the course review team that the lecturer made some unintentional errors in speaking, and all these errors were picked up and corrected by the course review team.
Cloud Data Warehouses
The cloud data warehouses module contains three pillars of content — Introduction to Data Warehouses, Introduction to Cloud Computing and AWS, and Implementing Data Warehouses on AWS.
The first part basically introduces some classic and old-school images about the data warehouse, and it covers multiple traditional data warehouse architectures such as the Business Process Architecture (BUS), the Independent Data Marts, the Corporate Information Factory (CIF), and the Hybrid Bus and CIF Architecture. Even though not all of these topics are widely used nowadays, it’s still nice to have a chance to know a broad history and pictures about this industry.
The rest of the two parts are about AWS and how to implement data warehouses in the AWS Redshift. If you know Python and have basic ideas about automation and Infrastructure as Code (IaC), then I’d suggest you directly walk through the exercise notebooks to study the context and get familiar with the syntax and workflows — the demo videos are a little bit fast on showing the code and giving the explanation parallelly.
The project of this module is about building a data warehouse by using AWS Redshift. The project context is basically the same as the first project, so it’s like doing a similar task but with a new workflow and framework. The schema design is still the core of this project since you need to figure out how to choose DISTKEY and SORTKEY(s) for staging tables, final dimensional tables, and the fact table.
Data Lakes with Spark
This module dives deeper into the AWS ecosystem (especially EMR) and also introduces Spark (with PySpark). This module covers Spark very well in terms of talking about its use cases and syntax and provides practical solutions to handle the data skewness when working with large data volumes.
However, it looks like that some of the course content should be updated because AWS has already done a couple of iterations for its web pages, and this is very important for some of the demo courses that try to show the AWS workflows to people who might have barely approached to it. For example, if I simply and barely followed the course materials, it seems I could not set up an EMR cluster and attach an EMR notebook to the existing EMR cluster with the command the course material provides. Moreover, I found there seems one of the demo videos was not synchronized with the corresponding course content.
The project in this module shares the common context as the ones in the above modules, but it requires leveraging EMR and PySpark to process the data from S3 and store the fact and dimensional tables back to the S3 —obviously, this project treats S3 as a data lake.
Data Pipelines with Airflow
To be honest, this module is the best one compared with the above modules. The lecturer (let me shout out to Benjamin Goldberg here ;p) prepared very detailed materials about Apache Airflow — I’ve actually attended other workshops on other platforms such as O’Reilly, I could strongly feel this one covers a broader scope of practical use cases and contains more production level tips. Moreover, the exercise platform for this module is also good and user-friendly.
My organization does not use Apache Airflow because we have been using a heavily customized internal scheduling and data orchestration framework in production for a long time and it still works efficiently for our daily business operations, however, I’ve been interested in this field. Thus, this module definitely benefits me a lot, and those exercises allow me to practice and create some hands-on coding pieces to better understand this trendy tool.
The capstone project is quite open-ended. Generally, it offers two options — one is selecting the materials and data provided by Udacity, the other is going with some 3rd party data set.
I choose to go with the data set that Udacity provided, and the main topic was about building ETL pipelines for I94 immigration activities — as a foreigner, it looks interesting to me since I always remember that every time, I have to join the long queue in the customs lol. For more details, please review this repo.
Should I take it?
After reading through all the content above, you might want to ask yourself a question — “Should I take it as well? Worth it?” To be honest, I don't know how to answer this question. I think this can be case by case. This Nanodegree is generally good and professional. However, the normal price is not that affordable. The good thing is Udacity periodically releases some coupons or discounts, and I’ve been keeping tabs on it since the pandemic. In the end, I took the program at a fairly low price with the monthly payment plan, and I finished it within the first payment circle, so considering the knowledge and cost I paid, I think this is still very cost-effective — just like doing many other tasks, you gotta be a certain self-disciplined.