Trends in the data industry, how lakeFS fits into the new data stack, and a personal story of why I chose to join lakeFS.
Open Source Is Everywhere!
For the past decade, open-source has made a home in the ever-evolving data stack zoo known as the Hadoop ecosystem. To keep everything in order, ZooKeeper was tasked with coordinating the zoo resources, and all the animals were happy.
Until one day, the zoo got new creatures to take care of like Spark (a future star), Kafka (an undefined new friend) and others that started joining the mix. Together with Yarn, they preferred not to mix animals amongst each other and kept resources in strict order, where they weren’t always fully utilized.
Over time the zoo resources stayed the same but the demand for supporting new creatures grew tremendously and ZooKeeper found himself in a party of adventurers, making the journey to the cloud…
The Hadoop ecosystem led the data stack with an open-source first approach. It is no wonder that today, the new data stack is, guess what! Also open.
Why It Matters
Open Source allows us to experiment with the tools, assess their relevance to our tech stack, and exchange knowledge with fellow practitioners. This is true for technologies as well as for file formats, like Apache Parquet, Avro, ORC and many others. By virtue of being open source they are highly-used and well-connected with the rest of the stack.
The commonality is open format! This approach enables the community to build connectors and integrations that go beyond the immediate use and support adoption of the technology as well as innovation.
Open source is a practice as well as a culture.
Changes in the New Data Stack
The Hadoop stack has progressively moved to the cloud with concepts largely borrowed from the traditional SQL/database worlds. This movement introduces new scalable solutions that aim to solve the problem of enabling data computation capabilities over large, distributed datasets. Some refer to it as the puzzle of database features.
On Increasing Data Complexity
Our industry is at a point where the size of data is continually growing. The forces of digitization heavily impact how much data we process, capture, and exploit to drive innovation and value.
The challenges of distributed systems required to manage this data cannot be overstated. All of the functionalities of observing, cataloging, governing, and storing data while juggling the building blocks are tough! We need better ways to manage the data chaos!
If you’re familiar with the concept of chaos engineering, it speaks heavily on continuously running automated experiments in production that help ensure systems are being designed to minimize the blast radius of potential issues.
We need the same idea for data. We need systems that allow us to manage chaotic data spaces while allowing experimentation. Sound complex? That’s because it is.
But Hi! There’s good news: there are practices forged over decades of developing software that can be borrowed, repurposed, and applied to the discipline of data. These include:
- Automated, testing-based deployment
- Source control management (git)
- Proper logging and observability into processes
Taken together, they form important pieces of what’s called Application Lifecycle Management, or ALM. Not everything that’s important to software development has a parallel in data. But, much of it does and we’ve still got plenty of areas where we can catch up to our counterparts in software.
Where lakeFS Fits in This Picture
So far we discussed the complexity of data systems, why open source is important, and the need for better tools to manage the lifecycle of data.
This is what lakeFS solves for. It is an open-source project that enables git for data, testing-based deployments, and more. I’ll go into more details on how it does this in the future, but for now I want to focus more on what joining lakeFS means to me personally.
You spend a lot of time at work, so career choices, even though they’re only one aspect of your life, can be a big deal. Taking on a new position, within a small yet courageous startup, implies that you have a deep connection to its vision, mission, and culture. You truly want the company to succeed and it gives you the energy and a sense of purpose.
Wait, what do you mean by that? During my time at Microsoft, I got exposed to this beautiful Ikigai diagram:
It’s a Japanese concept that speaks about “a motivating force; something or someone that gives a person a sense of purpose or a reason for living”. Generally speaking, it may refer to joy or a sense of fulfillment. To break it down, each one of us has things that:
- We love doing.
- We’re good at.
- Makes the world better.
- We can be paid for.
How are lakeFS and Ikigai related?
Allow me to share my personal experience to help you understand how joining lakeFS relates to Ikigai values and the company culture.
1. What do you love?
I’m an engineer. I get excited diving into open source technologies, learning about the features from the actual code, and discovering their design patterns by reverse engineering their software architecture. Together with my colleagues, I study and research new architectures that eliminate friction for Data & AI practitioners and later share the findings through educational content such as books, presentations, step-by-step tutorials, and on @adipolak 🐦.
This relates both to our mission and to our culture:
“We value agility and continuously improve”
As a young company, our ability to learn, iterate and practice a growth mindset as a team is one of our core strengths.
Another aspect is community. For many years I volunteered in, led, and recently built the learning-together DataEngineers community . A community for me is a safe space where people come together to overcome an obstacle, get empowered, innovate and learn from each other’s experiences.
It aligns well with other culture values:
“We are part of the community…”
We volunteer with and support communities around us during work hours.
2. What are you good at?
I’m not the best, but I work with great people and my role as a lead is to enable a balanced environment for team excellence. Two of our company cultures speaks about creating a safe space for learning and growing:
“We are a Team: Authentic, collaborative, show solidarity and trust“
We win together and we fail together. When you are part of a team that follows this culture are able to grow professionally which leads to the next bullet:
“We practice excellence.”
We support professional and personal growth. This is what practicing excellence means to us.
3. What does the world need?
From a tech point of view, lakeFS enables us to adopt continuous principles for data. It’s going to make the world a better place for Data & AI practitioners who hopefully will be able to sleep better at night and worry less about precious production data.
4. What can you be paid for?
Money shouldn’t be the only reason to accept a job offer. Fortunately, the skills of Data & AI practitioners are very much in demand and make this requirement of Ikigai not too difficult to fulfill.
Finally, what should you expect from me?
I’ve taken a new position at Treeverse as VP of Developer Experience! I am excited about this change and will continue sharing my learnings with you all!
My responsibilities are directly connected to the company vision and mission, building an organization that enables better, more holistic open data stack technologies, and by doing so, crafting practices that will make engineers’ lives more enjoyable.
If this sounds exciting to you, maybe you’ll even considering joining me on this journey. Open positions for my team at lakeFS can be found here 🙂
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
One of the reasons behind the rise in data lakes’ adoption is their ability to handle massive amounts of data coming from diverse data sources,
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines. It’s the easiest way to transform any Python function
Table of Contents