Brooklyn Data Co.

By BROOKLYN DATA CO.

NOVEMBER 23, 2020

dbtTechnology

6 things we love about dbt

By BROOKLYN DATA CO.

NOVEMBER 23, 2020

dbtTechnology

The “data build tool” known as dbt has made pretty major waves in the analytics and data space. In short, dbt is a full-stack data tool that allows data analysts and analytics engineers to own the “transform” step in the Extract-Load-Transform (ELT) pipeline. It’s also open source and has developed a thriving community around it. At Brooklyn Data Co., we consider it a regular fixture in the modern data stack and frequently recommend dbt to our clients.

We asked some of the Brooklyn Data Co. team what they love most about dbt, and there were a lot of shared sentiments. Listen to their conversation by watching the video here, or check out a summary below. You’ll hear from two of our analytics engineers, Raphaela “Raphi” Abramson (she/her) and Eli Kastelein (he/him), and two senior data analysts, David Lai (they/them) and Elizabeth Martens (she/her).

Watch the conversation here:

1.Tests

What they are

dbt allows us to transform our data and incorporate business and organizational logic into the flow of the data pipeline. The transformation functionality of dbt is nicely complemented by dbt’s built-in data quality test functionality. dbt tests are run from yaml files where field descriptions and tests are managed. If the data fails to operate under those assumptions, dbt then flags that error, making problems easier to diagnose.

https://docs.getdbt.com/docs/building-a-dbt-project/tests/

What’s great about them

Businesses always want their data to be “high quality,” but often don’t have a clear definition of what quality is. Enter: dbt tests. Testing in dbt is fantastic when it comes to sussing out what data consumers actually think their data should look like. Elizabeth explains, “By creating these really specific tests and nailing down what the edge cases are, it creates a forcing function for those businesses to have conversations they wouldn't otherwise have about what quality data means.”

In general, dbt tests make being standardized across an org much easier. Raphi recalls a previous job where creating her own tests would work just fine for her, but ultimately weren’t documented anywhere else because there wasn’t a system in place to do that. What are the implications of this, exactly? Well, nobody stays at a job forever; by Raphi’s work existing in a vacuum, it made it very difficult for other people to take over her job without a substantial amount of onboarding. With dbt testing, you can avoid that. Assuming the model is structured correctly and has proper test coverage, it can simply be handed over to anyone who needs it.

dbt tests are also a great way to check our assumptions—something critical when working in data. And they make it easy: as Eli points out, you can run thousands of tests in a project with just a few lines of code for each one and no disruption to your model. They can run the gamut from a simple null test to using custom data to test complicated logic that's very specific to the business.

2. Version control

What it is

Version control is made seamless and easy in dbt. Elizabeth likens it to writing a paper in college: you often have several versions: V1, V1.2, V2, V3, V “My mom looked at this one,” V4 final, V4 final final, etc. Version control in dbt is essentially a much better way of doing this. It’s managed through git, which makes collaborating on a single code base incredibly easy (a big plus for teams who need to work closely together).

What's great about it

Along with easy collaboration, version control being managed in git enables you to push out something for review, and then in tandem build out another feature of your model. As David explains, it creates a more modular system for your work, giving peace of mind especially for large, challenging analytics problems with complicated logic. If you make a mistake, you can easily go back and pick up from a different version, and even learn from what you did wrong. It also enables you to approach these problems at a smaller scale. “What version control helps me do personally is sort of get out of that rabbit hole and just think about one incremental step at a time, while also allowing other folks to help chip in when needed.”

Raphi points out how version control naturally prevents people from indulging in their worst data instincts (like creating a million edge case statements), encourages centralizing data, and just generally makes us smarter about how we work. “You can't create Frankenstein tables unless everyone else on your team is okay with Frankenstein tables. It's almost like a form of checks and balances because you have your stuff peer reviewed, so you can't be lazy.”

Eli agrees that the built-in git workflow is great, highlighting the comfort in knowing that every branch has been peer reviewed and you have a master copy that can be trusted.

3. DAGs and dbt docs

What they are

dbt docs enable you to generate documentation about your project so anyone looking in can understand the models and their dependencies. A major part of this are the graphs that dbt generates to show the model dependencies, known as a directed acyclic graphs or “DAGs.” dbt builds these automatically and shows you all of the dependencies in your project. It’s a very visual way of knowing every single model that you have, and is especially useful as your project gets bigger and the number of models increases.

https://docs.getdbt.com/docs/introduction/

What’s great about them

Eli explains that without the DAGs, it can be tough to see how things relate to each other, or where upstream dependencies are. “It's so helpful, because it really shows what's going on in your project. And it's helpful from a debugging perspective for someone who's developing models to see.” The docs really come in handy when it comes to working with non-technical stakeholders: rather then having to point a stakeholder to a yaml file they don’t understand, dbt docs creates a nice web interface that goes hand in hand with the columns the data team has defined.

Ultimately, having everything in dbt docs and being able to leverage the DAGs creates a level of transparency for anyone who wants to look at your data. Elizabeth elaborates more on why this is especially great with non-technical stakeholders: “A COO or CEO has an intimate knowledge of what they think the data should look like or how they're presenting it to investors. And so being able to have them look directly at how the data definitions are being defined by the data team I think is huge for ironing out any miscommunications that might be happening from the very hands-on data side of the organization all the way up through the top.”

Finally, it's important to note that the DAGs are always automatically updated and maintained. There’s no manually maintaining a chart that someone has to constantly spend time on, leaving room for error and ultimately something that is not visually appealing or useful.

4. Environment management

What it is

dbt allows users to manage their own environment, meaning that you essentially can work in your own sandbox and know with certainty that you’re not inadvertently working on the production data model or in someone else’s schema.

What’s great about it

Raphi thinks this is one of the best parts of dbt: “I love that when I'm working with dbt, I know what the production-level data is. I can always access it depending on what I run and what I'm looking at.” Raphi and Eli both recall the trouble with not having environment management in previous jobs. Not knowing where the data came from, having multiple versions of the same data, trying to test something out and having someone else adopt it so you now have to maintain and update it (which was never your intention). These are all challenges that environment management eliminates. It also ensures that everyone is getting the same results—that no matter what environment you’re using, you’re all looking at the same raw data.

Additionally, because the interface between dbt and git allows you to run things locally, environment management gives you the ability to move back and forth and work on various features (e.g., code reviews for other people, running other people's code locally). It makes troubleshooting easy. If someone is having an issue, you can pull their code into your own environment and run it locally without compromising anything.

Elizabeth also points to how environment management can help develop younger analysts. It provides a framework for them to keep their own changes and features separate instead of being in a silo and building things around only what they know. It enables them to get used to the development cycle of making a branch and to learn without impacting production: “Making their changes, testing their changes, merging, and then moving onto the next thing—I think it is really helpful that way.”

5. Ease of adoption and use

Many folks, including Raphi, are surprised by how easy it was to get used to dbt: “It's basically SQL, with some Jinja sprinkled in. I had never used Jinja before this, and I tend to be pretty cautious when I'm starting off new things but after a few hours of using it, it felt very intuitive.”

dbt is often a thing that people try to do manually before using dbt. Elizabeth recalls, “When I first was learning dbt when I came to BDC, it was sort of like, ‘Oh, I have tried to do this with tools that weren't designed specifically for data modeling via linking scripts together with homemade dependencies and a Java system of some sort.' And dbt is just a lot slicker than that. I don't have to make it myself or maintain it, which is thrilling. The fact that it's open source is great, because if you want to add a feature or something like that, you can absolutely do so. And also it's just made very robust, much more so than any system that I or a company that I worked with had internally created out of necessity. It's designed to do this thing and it does it quite well.”

David and Eli agree that the level of transparency dbt provides certainly helps with the learning curve. When David started learning dbt, they had only been using SQL for about 3 months. “What I loved about dbt was that it really was a file, an organizational structure to say: ‘Here's the model, here's where it's referring to.' And so I think, going into a relational database, it really helps you just to figure out how you want to create this world.” Eli loves that there’s no black box about what’s going on under the hood, like some other data tools. You know that it's simply running SQL statements against your database, you can see the exact statements it’s running, and you can see all of the compiled code. There’s no secret when something’s not working, you just have to look at the code.

6. The Community

What it is

dbt is incredibly supportive of the analytics community. They have an active and evergrowing Slack group of over 8,000 people and 50+ channels, events to support the data community, a strong stance on best practices in analytics, and are passionate about open source.

What’s great about it

If you’ve made it this far, you’re probably sensing a theme across some of dbt’s best features: they optimize for collaboration. It’s only natural, then, that dbt would create an equally collaborative community outside of dbt. David speaks to the value of the Slack group: “I remember when I first onboarded onto dbt, a lot of times I would get a little confused by the documentation, or didn't really know where to go. I would just shoot a quick message in a Slack channel, and immediately I would get so many responses of, ‘Oh, have you tried this? Have you tried that?’—almost like a mini dynamic Stack Overflow, but in Slack.” Eli echoes these sentiments: “I feel like in the data world, a lot of people are doing the exact same thing and having the exact same problems, but at different companies, and there wasn’t a really a place for them to talk about those things until this Slack community popped up. And I think that's important because data teams are often pretty small.”

dbt also offers frequent Office Hours as another avenue to support and troubleshoot with the data and analytics community. Elizabeth shares her experience with Office Hours: “People get up and present dbt projects that they've worked on, and they're super cool. They're also recorded. So you can go to their YouTube channel and check them out. I’d actually been watching an Office Hours, and they had mentioned another Office Hours that applied to a problem I was working on, and was able to go back and look at that.”

To sum up...

We’re pretty big fans of how dbt impacts and enriches our daily work. If you’re new to dbt and wondering how it might integrate into your data stack, here are some things it doesn’t do:

It’s not a magic code writer

You can still write bad code with dbt. It has a lot of guard rails in place to prevent that, but you still need to establish boundaries. Elizabeth compares it with bowling: “I have in fact bowled with bumpers and still gotten a gutter ball. I am uniquely talented in that way. And I think that's sort of the same thing with dbt, so it's still really important for an organization or even just a one-person data team to put their flag in the sand and say ‘Hey, these are the standards that we're going to maintain.'”

It’s not made for the “E” and “L” part of ELT

dbt is great at transformation, because that’s what it's made for. But it doesn’t solve for extracting and loading, and it's still important to invest in a tool that will solve that for you (e.g., Fivetran, Stitch, etc.).

It helps you architect, but it won’t architect for you

It’s critical to establish your own ground rules before modeling in dbt, or you’ll end up with a mess. You also still have to put in the work to model: as David says, “Sometimes I'm like, ‘Oh, I just wish this model could be done.' You know? I just wish we could solve that problem. And dbt doesn't give you this grant. So although sometimes development can feel very insular and just sort of in your own head, hopefully what dbt can help you do is give you access to that community of other people who are thinking about the same thing, so that it can feel a little bit more collaborative and out of your head.”

Check out our GitHub repo, where you can find our dbt style guide, sql style guide, and PR template.

Want to chat with us about how you can be leveraging dbt in the modern data stack at your company? Drop us a line: hello@brooklyndata.co.

Build data capabilities that last with Brooklyn Data Co

Get in touch