Aaron Carmichael | InterWorks

How to Configure a Local dbt Core Environment

Aaron Carmichael — Wed, 11 Oct 2023 17:28:10 +0000

Setting Up a Personal dbt Instance

Over the past few weeks, I have been diving into dbt in my free time to learn more about the tool as well as about data preparation before it normally reaches my hands in Tableau. If you’re unfamiliar with dbt, check out this introduction from our very own Paul Vincent. In the beginning, I created a trial account with both Snowflake as well as dbt Cloud to dive right in. In order to explore some administrative aspects of dbt, as well as to experience building a full environment from the ground up, I wound up creating my own instance locally on my laptop as well. I want to share how I went about this and encourage anybody who is interested to give it a try, as this is an easy (and free) way to learn.

Considering that dbt offers free trials, it might be a bit confusing as to why you might want to do this. There are a couple of reasons:

Dbt has two versions, being dbt Core and dbt Cloud. Their free trials are done entirely through dbt Cloud alongside a cloud data warehouse of your choosing. However, given that dbt is open-source, there are ways to use it outside of their cloud environment, and setting up a personal environment is one of the best ways to explore this. While dbt Core and Cloud are based on the same product, there are a few quality-of-life improvements that dbt Cloud offers that are not immediately available in dbt Core.
Dbt Cloud:
Dbt Core in VSCode:
While the free trial of dbt Cloud does not have a time limit, my accompanying trial of Snowflake did. With the methods I describe below, you are free to explore for all the time you need.

Create Your Data Source

First, you will need an underlying data source that dbt can use. Fortunately, dbt is not restricted to cloud data warehouses and can connect to many different local data sources, including PostgreSQL, MySQL and SQLite.

Note: There are some differences in dbt depending on the underlying data source. Whichever one you use, I highly suggest reading the documentation to learn about these differences.

For this tutorial, I chose PostgreSQL, as it is the one I am most familiar with. After downloading it and running the installer, a server was created locally on my machine, and I created a superuser to essentially act as the DBA. Feel free to test dbt along with various permission settings within Postgres. After installing and configuring Postgres, you will need six pre-existing tables to hold your raw data, recreating this schema (note: the actual data will be generated in a later step):

Set Up an IDE

Next, you will need an IDE of your choice to write the SQL that powers dbt. I personally use VSCode, as it has some extensions that can greatly improve quality-of-life while writing SQL. These extensions include vscode-dbt, dbt Power User and Better Jinja. However, you may use a different IDE as well.

Additionally, you will need Python. At the time of writing this blog, dbt supports Python 3.7-3.10. After installing python, it is recommended to have a dedicated environment specifically for dbt, which can be accomplished by using something like venv. After activating your virtual environment, you can begin installing dbt.

Note: The specific package you install will depend on your underlying database. In my instance, I installed dbt-postgres via pip install dbt-postgres.

At this point, I recommend creating a dedicated directory anywhere on your machine that will hold your project. Within this directory, you can simply run dbt init and dbt will automatically generate the necessary file structure within this directory, where you can store your models and yml files. Note: installing dbt via pip will allow dbt commands to be used anywhere on your machine, i.e. you no longer need to be in your dedicated python environment when using dbt.

As an example, I’ve created a brand new directory that contains nothing. After opening it in vscode and running dbt init in the terminal, it is now populated with the necessary folders for a dbt project named “test_project.” You will also have the option to set the type of database that you will be using, which will be important in a later step.

Before:

After:

Generate the Data

We’re almost ready! However, what good are data-related tools without data? Fortunately, dbt has created a utility called jafgen that can automatically generate CSV files based around their classic Jaffle Shop example used in many other dbt training materials. Running these two commands will install jafgen in your environment and create three years’ worth of example data:

pip install jafgen
jafgen --years 3

These CSVs can be loaded into the tables you created earlier via either the data import tool within the pgAdmin application (installed alongside postgres server earlier) or via any other method you would like to use.

Connect dbt to Your Data Source

Lastly, before we can test that everything is working, we need to tell dbt how to connect to our data. When we initialized our project via dbt init earlier and chose the underlying data source, that created an important file called profiles.yml. By default, this is located in the \.dbt folder of your home directory (for example: C:\Users\acarmichael\.dbt). This file is where you can define DEV and PROD environments as well as connection details and credentials for your database. You can find the details on profiles.yml here. After configuring your connection profile, you can edit your dbt_project.yml to use that profile.

Newly created profile:

Example profile, filled out (use your own username/password):

Finally, the time has come. You can start by testing to make sure dbt can connect to your data properly by using dbt debug. If all checks pass, then you may start developing your DAG and exploring to your heart’s content. If the output looks similar to this, then you’re good to go.

The post How to Configure a Local dbt Core Environment appeared first on InterWorks.

Filtering in Tableau While Keeping a Specific Category

Aaron Carmichael — Tue, 28 Mar 2023 17:52:21 +0000

Check out the workaround described here on Tableau Public: Tableau Filtering – Keep Null Out of List | Tableau Public

TL;DR – With this workaround in Tableau, you can always include a category in a viz while not including it as on option in a filter, making it always visible regardless of what the end user selects.

The Workaround

I recently had a client request along these lines that had me scratching my brain: “How can you configure a filter in Tableau to always display a certain category (for example, Null) without listing it as an option for users to exclude?” Depending on your dashboard audience, the word “null” can be ambiguous and have no clear meaning. If there were a way to exclude it from the filtering list while ensuring that null records were still displayed, we could keep end users happy while retaining the data in our view.

To recreate this scenario, I found a dataset containing all Dunkin Donuts locations in the US. For this example, I related this data to a census dataset showing population by county throughout the United States. Minor data preparation was performed in Dataiku.

In this hypothetical scenario, the end users of this data would like to see a map of all Dunkin’ Donuts locations and filter on whether or not the store also contains a Baskin Robbins. This should be displayed on top of all US counties colored by population. At first glance, it’s a simple request. We can construct a map using separate layers for counties and for stores, and then filter on whether or not the location has a Baskin Robbins:

Notice that our filter has an option for “Null.” Because of the structure of our data, the two datasets are essentially being full outer joined. As a result, the records in the final dataset for counties that do not contain a Dunkin’ Donuts store will have a “Null” record for whether or not the store contains a Baskin Robbins, since there is no store in the first place. If we exclude Null records, we can see this on the map:

We’ve lost all data for counties that don’t have a Dunkin’ Donuts. The easiest solution would be to tell our users to not exclude null records, but that could be a hard concept to teach depending on the size and technical aptitude of the dashboard audience. The better solution is to always include null records and not display it as an option to filter. Here’s how we can accomplish that.

Create a calculation that recodes null records to have a value that already exists in the data. In this case, I am recoding null records to be true:

*NOTE: You may be asking why we couldn’t stop at this. After all, we no longer have null records in the data. However, if end users now filter out TRUE records, then you will lose those records that were originally null, when we want to keep them regardless of how the filter is configured.

Create a set based on your new list calculation:
Create a filter calculation:

Use this calculation as a filter set to TRUE on your view. From here, use the set created in step 2 as your filter:

We can now safely filter on the status of whether or not a store has a Baskin Robbins location and not worry about losing all other data.

Now, you probably noticed that we are now missing data for some, but not all counties. The workaround demonstrated here is not a perfect solution and has drawbacks. In this case, we now lose county information for counties that do contain stores as opposed to the other way around. Your mileage with this workaround may vary. If you want us to take a look at your dashboards and provide advice or other workarounds, feel free to reach out and see what we can do for you.

The post Filtering in Tableau While Keeping a Specific Category appeared first on InterWorks.

Exploring the SQL Pipeline Features in Dataiku

Aaron Carmichael — Fri, 28 Oct 2022 19:06:39 +0000

In the modern analytics stack, data is constantly on the move. While all data originates somewhere, it typically must go somewhere else before it becomes valuable for the data analysts/engineers/scientists. However, when working with pipelines that deal with large amounts of data, it is important to construct them to work as efficiently as possible.

A well-designed pipeline can be the difference between having your data ready and available in an hour vs. a minute. Luckily, one of Dataiku’s best features is the built-in ability to orchestrate and optimize your data pipelines. How, you may ask? Here, we’ll be exploring a couple of methods to do exactly that.

Getting Started

To start, it’s important to make sure that you are using the best computation engine for the job. Remember: the computation engine is a feature of a recipe as opposed to a full Dataiku project, so this is a feature that can be configured multiple times per project. To change computation engines, double-click on a recipe and then click on the gears next to the text under the run button. From there, you will see a list of engines available based on where the data is located. You can see in the image below that Dataiku will place a star by the computation engine that it recommends:

Types of Computation Engines in Dataiku

While there are several different computation engines, I’ll group them as the following:

DSS Built-in Computation Engine:

This option will always be available to you, no matter what data you are using or where it comes from. It has all reading, writing and calculations performed by your Dataiku instance. However, unless the specs on your Dataiku server are out-of-this-world, this will typically yield the slowest results. This is also the default option when constructing your Dataiku flows. Lastly, this will also bring a copy of your data into your Dataiku instance, so be mindful of the volume of data that will be processed.

Using your underlying database or other infrastructure:

Depending on the location of your data as well as your available infrastructure, you may have more options for your computation engine that can drastically increase the efficiency of your data pipeline. For example, if you are working with data that lives in a Snowflake database, you can push calculations and transformations to your Snowflake instance rather than your Dataiku server. Additionally, if you have an available instance of Apache Spark, Hive or Impala, you can also utilize these technologies for your pipelines. In almost every situation, this will yield a more efficient and quicker pipeline. However, it is important to keep external factors in mind. For example, Snowflake credit usage should be considered. Make sure to follow best practices for all technologies involved outside of Dataiku.

Note: Not every function within Dataiku can be pushed to alternate computation engines. For instance, various operations in a prepare recipe cannot be converted to SQL, so you will be unable to use your database for computation. To work around this, try to group calculations that are SQL-compatible away from SQL-incompatible operations to maximize your database usage. To see more about which recipes can utilize this capability, see the chart here.

SQL Pipeline Optimization

When building a flow in Dataiku, every recipe will output a dataset. If you are an analyst performing ad-hoc transformations to explore your data, then this is ideal. However, if the end goal is to create the most efficient pipeline as possible, then this will result in unnecessary reading and writing to datasets. Within a Dataiku flow, multiple consecutive SQL recipes or SQL-compatible visual recipes can be configured to skip the writing of intermediate datasets by using the experimental “SQL Pipeline” feature. A similar feature exists for Apache Spark; however, we will be focusing on SQL pipelines for now. By enabling this feature, multiple consecutive recipes that utilize the same SQL computation engine will skip the intermediate reading and writing of datasets. Rather, your initial dataset will go in, and your final dataset will be written with nothing left in between. Enabling this feature is easy.

Enabling the SQL Pipeline

1. Select each recipe that will be a part of your pipeline by holding “ctrl” and selecting each one individually in your workflow view. Within the “Actions” section of the pane that opens to the right, under “Other Actions,” there is an option to “Change SQL pipelineability.” Within this menu, tick the boxes to allow those recipes to be part of a SQL pipeline:

2. Despite not being built by the SQL pipeline, the intermediate datasets will still appear in the flow. Select all intermediate datasets. Like the previous step, under “Other actions,” there’s an option to “Allow build virtualization (for pipelines).” Make sure this is selected for all intermediate datasets:

Testing it Out

To dive into these different methods, I’ve constructed a single Dataiku workflow that performs the same data transformations using three methods: using the DSS computation engine, pushing calculations to Snowflake, and using the SQL Pipeline feature with Snowflake. Pictured below is the full flow:

This experiment utilizes FAA data tracking flights from all over the US across many years. In the section labeled “Prep,” I have filtered down the full FAA dataset to a specific set of airlines that took off in a specific group of states in the US. Then, the data is joined and cleaned using the discussed computation engines. At the end of the flow, the datasets are just shy of 3 million rows long. By using different views of the Dataiku flow, we can confirm that one branch uses the DSS engine and two use the Snowflake engine. We can also confirm that one of the branches using Snowflake is a SQL pipeline:

I’ve created a scenario that runs all three sections of the flow in succession, and we can look at the time needed to finish these jobs as a comparison point:

Looking at the runtime for each branch, we can see that using only the built in DSS engine results in significantly longer runtimes. If we were to scale this to enterprise-level pipelines that handle larger quantities of data, then we could potentially see flows that take hours to run.

Consider this: let’s say your BI department seeks out to build dashboards in Tableau that reflect the most recent data, and these dashboards are dependent on data sources built via Dataiku flows. Looking at the result of this experiment, we know that if the Dataiku flow were not optimized, then those dashboards would not be updated as often as requested.

AI in NYC: InterWorks at Dataiku’s Everyday AI Conference

Aaron Carmichael — Wed, 27 Jul 2022 14:39:44 +0000

New York City: the city that never sleeps. It’s the home of Wall Street, the Empire State Building, limitless talent (shoutout to the cast of Hadestown) and so much more.

At the end of this past June, I, alongside a team of my fellow InterWorkers, had the chance to head to the Big Apple to attend the Everyday AI Conference, hosted by Dataiku, where we got to learn about what’s on the horizon for both Dataiku as a tool and the artificial intelligence / machine learning space as a whole.

Above: Members of the InterWorks crew sent to NYC.

Hot Off the Presses: Dataiku 11

For those unfamiliar with Dataiku, it is essentially a tool that combines the powers of last-mile data prep with data science capabilities. Rather than spending hours writing custom Python code to perfect a single model, you can leverage Dataiku’s AutoML features to train several models and tune hyperparameters in just a few clicks. You can take a deeper dive and find out more in this blog by InterWorks’ resident Dataiku Neuron, Rachel Kurtz, who also spoke at Everyday AI.

Dataiku is continuing to add more features to their product, with some exciting new ones coming soon with Version 11. For a deep dive into the new features of Dataiku 11, check out Dataiku’s presentation here. While definitely not an exhaustive list, here are some of the ones we’re most excited about:

Visual time series forecasting – Until this version, Dataiku only supported predictive models and not forecasting models. You would find options for your standard logistic regression or random forest models, but not ARIMA models, for example. Forecasters rejoice, as you will now see time series forecasting as an option in the lab next to the other predictive model options.
Visual deep learning – Image classification has been added to the lab under visual analysis alongside object detection. Pre-built models have been included to save time on manual model development, and data augmentation allows users to generate more training/testing data by altering existing images (e.g., rotating/cropping images, flipping them, altering the colors, etc.).
Centralized feature store – Dataiku has always been about collaboration among individuals as well as across an organization. This centralized repository can store curated datasets that can then be pulled into other projects, minimizing the time spent manually sharing these same assets.
Flow documentation generator – As we all (hopefully) know, you should comment your code. Dataiku’s version of commenting code (besides actually commenting any written code) would be creating assets such as project wikis that contain any information you might need to understand a project. With this new generator, you can instantly create a document that includes a screenshot of a project flow and a summary of all datasets, recipes and folders. Features like this can help save so much time on a task that is often seen as mundane and tedious. Users can also provide custom documentation templates personalized to their organization. Lastly, this can be included as an automated task in a project, so documentation can be self-updating.
Model stress tests – How would your model behave if something in the input data (e.g., the distribution of a feature) were to drastically change? Sure, we could programmatically create our own stress tests. On the other hand, we can now use Dataiku 11’s built in stress tests, which allow you to create these tests much more quickly and efficiently.
Expanded visual logic – Whether it’s in Dataiku or Microsoft Excel, most of us have, at one point or another, created a long, convoluted string of if/else statements into a single formula. While doable, this obfuscates the purpose of the formula and makes it more prone to errors. In Dataiku 11, prepare recipes, as well as other filter locations, will include a new UX for developing nested if/else statements in order to remedy these issues. Additionally, the Switch() function has been added for more complex rules both in written formulas and as a processor in prepare recipes.

Above: InterWorkers manning our booth in between presentations.

What Lies in Store for AI/ML in the Future

While we loved learning about all the cool new tech coming in Dataiku 11, there was plenty of discussion surrounding high level concepts and visions for the future. One of the first presentations on the second day of the conference came from Florian Douetteau, CEO of Dataiku, where he reminded me of something important and presented an idea of how we build a world of self-service analytics that includes predictive elements.

First, when we talk about the use of AI in corporations, it’s important to remember that this is not necessarily referring to deep learning or complex neural networks. The term AI, at the end of the day, is a buzzword. As a buzzword, it is commonly associated with the complex, sexy solutions that people think are awesome. However, a linear regression model is just as much AI as a convolutional neural network.

To corporations newer to utilizing AI in your day-to-day operations: Do not be scared by the thought of needing subject matter expertise in ultra-complex solutions. We all have to start somewhere, after all. Florian also discussed how companies can structure themselves to best implement Dataiku and predictive analytics in their self-service ecosystems.

According to him, companies should create “analytics tribes,” where a small group, or groups, can combine domain knowledge and tech skills. These groups would be your primary Dataiku designers, as they can then act as liaisons between business users that know the domain but not the technology and tech experts that know the programs but not the domain.

Our First Neuron, Our First Speaker

Our very own Rachel Kurtz gave a presentation alongside the Head of Marketing at Meta, Nicole Alexander, about how we move towards a world fully driven by analytics. In fact, their proposal is that we should work towards a grand total of one billion knowledge workers. However, how exactly do we get there?

Above: Rachel Kurtz on stage for her presentation.

A common problem that we see today is that working in analytics and data science is seen as unattainable to most people, where jobs require masters degrees, if not doctorates. To combat this, both Kurtz and Alexander suggest that, in the future, data science and analytical knowledge will become a skillset rather than a profession. Rather than seeking a job solely as a data scientist, one can seek a job in a given vertical that would include data science work.

By eliminating this barrier, the world of data becomes infinitely more accessible to countless people, allowing all sorts of people to contribute. They also gave advice to companies to prioritize their current employees and foster any interest they may have in data science, being sure to include professional development time that can be used to progress along their data science journey.

But how do we start working towards one billion knowledge workers? During the conference, there was no shortage of companies detailing how they were using Dataiku to strengthen their analytics ecosystems. We heard stories from companies such as Pfizer, Ralph Lauren, Boeing and more, about their digital transformation journey towards self-service environments. Just about every company talked about the collaborative capabilities of Dataiku that broke down existing silos that allowed them to democratize their analytics.

However, one thing that was mentioned several times was that many companies held internal trainings, and that these trainings were a key part of their adoption of Dataiku. For companies who have either already invested in Dataiku or are considering it, this is where we, InterWorks, can help out. We have people across the globe licensed to deliver Dataiku trainings, and we assist in your journey of Dataiku adoption and your journey to self-service analytics. First up, if you haven’t already, give Dataiku a try with a 14-day free trial.

The post AI in NYC: InterWorks at Dataiku’s Everyday AI Conference appeared first on InterWorks.