Oz Katz
September 14, 2020

Day 1: Finally, an easy one

Got a pretty simple task for a change – read a new type of event stream generated by sales, and publish it to the data lake. Sounds like a straightforward ETL. I estimate this as one day of work. I can reuse a bunch of code from similar jobs and maybe finish even earlier.

This input data is being spilled from Kafka to S3, as Parquet files, which is nice, but oh boy – they decided to include every possible bit of information here – 350 different columns.

I need to understand both the schema and the properties – volume, column ranges, possible values and their meaning, and decide which fields I want to extract from this.

I also need to understand the SLA for this source – how often does it update? How much could it delay?  Is it only ever appended to – or will history change?

Another important aspect – how does it relate to other data in the org? Can I join it using the columns I have? Do I need a lookup table? Does a “user_id” mean the same thing here as it does in other collections in the organization?

Day 2: Let’s write a Spark Job

Once I have all these sorted out, it’s time to do something useful with this data. Let’s start by aggregating it, pruning out the uninteresting columns, and partitioning it by date. Not sure which of the six columns that look like a date I should be using. 

Guess I’ll need to sit down with the collection team a third time.

Now that I know what transformation I want to write, I remembered I have a Spark job that does something similar – I’ll modify it for this collection. 

Spark is so expressive – it’s only 6 lines of code!

Well, 6 lines of business logic surrounded by 120 lines of wiring and boilerplate. Still nice though.

Days 3-5: From sample data to real, somewhat big data

So running on a small sample looks ok, lets run it on all of yesterday’s events!

Spark basically died. I’ll now spend the next day digging through logs: Spark executors, driver, yarn, stdout and stderr. Found a weird JVM error – StackOverflow says I need to add some property to an xml file. Not sure I understand which one – all examples I found are for much older Hadoop versions. After 2 days of brute-forcing this setting into all possible XML files, I managed to complete successfully!

Days 6-11: Let’s write a table

So now, I need to think about how my output will be consumed: how often do I run it? Where do I keep its metadata? Do I provide an SLA? What guarantees do I give about the quality and validity of this dataset? Also, how do I keep it in sync with other related data? How long do I retain it for?

Days 12-17: Going to Production

Ok, so now we can deploy this to production!

I will write an Airflow DAG to orchestrate a daily job. When should it start? How do I know a full day is available for processing?

I’ll wait long enough and assume it’ll be fine. DAG deployed, scheduled for 3AM daily.

Maybe I’ll try to spin up a dedicated cluster on EMR? 

4 days and 7 Zoom meetings later, I have pretty clear requirements. I think.

Last time I tried that, we ran out of IP addresses in the subnet. A time before that we hit the quota for the instance type I chose, so it couldn’t start running.

Let’s try just running it at 3:17 instead..? 🤷‍♂️

Next morning I see that our new collection is live! Airflow is green! Task is DONE.

Took only 6 lines of code, 17 Zoom meetings and 3 weeks to complete 🤦‍♀️

Diary of a Data Engineer
Diary of a Data Engineer

Day 21: Oh no, I forgot about monitoring.

A few days later – i’m thinking, maybe it’d be a good idea to do a quick count(*) from Athena to make sure it still looks ok.

Damn. Yesterday’s partition is only about 10% the size of previous days. I better reply-all on my go-live email and tell everyone. Not even sure who else is using this. Seems like it’s gonna be one of THOSE days again…

Coming in the next day, I see my simple Spark job has been running for 8 hours already, barely making any progress. Apparently using a shared Spark cluster wasn’t the best idea – there are 12 other jobs scheduled to start at 3AM as well. Guess everyone else also thought 3 hours after midnight was a safe bet for daily data.

Oz Katz is the co-creator of lakeFS, an open source platform that delivers resilience and manageability to object-storage based data lakes. Check out the project’s GitHub repository to learn more.

LakeFS

  • Get Started
    Get Started