Azucena Coronel | InterWorks

ML in Snowflake with Snowpark for Python

Azucena Coronel — Tue, 17 Oct 2023 20:38:58 +0000

Earlier this year, Snowflake announced the availability of Snowpark ML during its Snowflake Summit. Snowpark ML is a set of tools, including SDKs and underlying infrastructure, to build and deploy machine learning models; all within Snowflake. It works with Snowpark for Python, so you can use the Snowpark DataFrames to hold the data, manipulate it and receive your prediction results.

It has been a hot topic and for a good reason! For one you can now transform your data and train your models without your data ever leaving Snowflake. Also, as they are based on familiar packages such as scikit-learn, it is pretty easy to start using them as it is stuff we’re already familiar with. And of course the advantage of being able to leverage Snowflake’s performance and scalability; with this we don’t need any more “ramping up our VM so that this thing can run” (yes, I’ve been there).

Of course, I couldn’t wait to get my hands on it and test some of its features. I loved the clear and easy steps to go from data to predictions, all using Snowflake power.

Snowpark ML has two key components:

1. Snowpark ML Development (Currently in Public Preview)

The package snowflake.ml.modeling provides different libraries for data processing, feature engineering and model training. These libraries are based on familiar ML libraries such as scikit-learn, xgboost and lightgbm. This package also includes a pre-processing module with libraries that can use compute resources provided by Snowflake’s Snowpark-Optimised Warehouses. These allow scalable data transformations.

What Is a Snowpark-Optimised Warehouse?

These type of warehouses provide 16x memory per node compared to the standard Snowflake virtual warehouses. This is recommended for workloads with large memory requirements, such as ML training use cases.

To create one of these, you need to specify the attribute WAREHOUSE_TYPE. The minimum size supported for these type of warehouses is MEDIUM

CREATE OR REPLACE WAREHOUSE snowpark_opt_wh WITH
  WAREHOUSE_SIZE = 'MEDIUM'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED';

To learn more about these warehouses and look at credit consumption, you can check out this article.

2. Snowpark ML OPS (Currently in Private Preview)

These series of components provide model management capabilities and integrated deployment in Snowflake. It includes a FileSet API that provides fsspec objects, which moves machine learning data seamlessly from a Snowflake table to the stage, then feeds the data to PyTorch or Tensorflow. It also includes a model registry which is a Python API to manage models within Snowflake and deploy them as vectorized user-defined functions.

What Is a fsspec Object?

Fsspec stands for filesystem specification, and the fsspec for Python is a project which aims to provide a unified pythonic interface to local, remote and embedded file systems and bytes storage. The need for this raise, as there are many ways to store bytes: in memory, to local disk, in a cluster distributed storage, to the cloud. To add complexity to this, many files also contain internal mappings of names to bytes, for example in a hierarchical directory-oriented tree. To manage all this complexity can become quiet cumbersome, so the fsspec project exists to provide a familiar API that will work the same way whatever the storage backend. You can read more about this project here.

I decided to try out Snowpark’s ML Development functionalities and used a familiar and easily accessible dataset of Scooby Doo episodes from kaggle. This is a great dataset to test several techniques as it contains all kind of features to play with: categorical, numerical, boolean.

The architecture of this solution involves the client side libraries: Snowpark, Snowpark ML and the server-side runtimes as we deploy the model as a user-defined function.

Are You Ready to Dive in?

To deeply understand the code, there is nothing better than to play with it, modify some stuff and try your own ideas. So, if you want to follow along, you can download the project code in Github Repo

Let’s overview the steps that I completed for this project:

1. Set Up Python Environment

You have several ways to setup your python environment, for a full explanation of these options please visit our “Definitive Guide to Snowflake Sessions with Snowpark for Python.”

If you are familiar with conda environment and are ready to dive in, you can create your conda environment using the conda_env.yml that contains the list of all the necessary packages.

name: snowpark-ml-scooby
channels:
  - https://repo.anaconda.com/pkgs/snowflake
dependencies:
  - python==3.9
  - snowflake-snowpark-python==1.6.1
  - snowflake-ml-python==1.0.2
  - pandas==1.5.3
  - xgboost==1.7.3
  - notebook==6.5.4
  - scikit-learn==1.2.2
  - cachetools==4.2.2
  - seaborn==0.12.2
  - matplotlib==3.7.1

If you already have conda installed, you can create the environment using:

conda env create -f conda_env.yml

2. Set Up Snowflake Environment

In this SQL script, I create all the necessary objects in Snowflake: databases, schemas, stages, roles, etc.

/*-----------------------------------------------------------------------------
Blogpost code: Predicting IMDB with Snowpark ML
Script:       01_setup_snowflake.sql
Author:       Azucena Coronel
Last Updated: 2023/08/22
-----------------------------------------------------------------------------*/


-- ----------------------------------------------------------------------------
-- Step #1: Create the account level objects
-- ----------------------------------------------------------------------------
---Database Objects
USE ROLE SYSADMIN;
CREATE OR REPLACE WAREHOUSE "LAB001_WH" 
  WITH WAREHOUSE_SIZE = 'X-SMALL' 
  AUTO_SUSPEND = 60 
  AUTO_RESUME = TRUE 
  MIN_CLUSTER_COUNT = 1 
  MAX_CLUSTER_COUNT = 1 
  STATEMENT_TIMEOUT_IN_SECONDS = 60
  SCALING_POLICY = 'STANDARD'
  INITIALLY_SUSPENDED = TRUE;

CREATE OR REPLACE DATABASE LAB001_DB;
CREATE OR REPLACE SCHEMA SCOOBY_SCHEMA;
CREATE OR REPLACE STAGE SCOOBY_ASSETS; --to store model assets


-- ----------------------------------------------------------------------------
-- Step #2: Create role for the project and assign the objects
-- ----------------------------------------------------------------------------
USE ROLE SECURITYADMIN;
-- Roles
SET MY_USER = CURRENT_USER();
CREATE OR REPLACE ROLE LAB001_ROLE;
GRANT ROLE LAB001_ROLE TO ROLE SYSADMIN;
GRANT ALL ON WAREHOUSE LAB001_WH TO ROLE LAB001_ROLE;

USE ROLE ACCOUNTADMIN;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE LAB001_ROLE;
GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE LAB001_ROLE;

USE ROLE SECURITYADMIN;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE LAB001_ROLE;

--Database
GRANT USAGE ON DATABASE LAB001_DB TO ROLE LAB001_ROLE;
--Schema
GRANT USAGE, MONITOR, CREATE TABLE, CREATE FUNCTION, CREATE PROCEDURE ON ALL SCHEMAS IN DATABASE LAB001_DB TO ROLE LAB001_ROLE;
GRANT USAGE, MONITOR, CREATE TABLE, CREATE FUNCTION, CREATE PROCEDURE ON FUTURE SCHEMAS IN DATABASE LAB001_DB TO ROLE LAB001_ROLE;
--Tables
GRANT ALL ON ALL TABLES IN DATABASE LAB001_DB TO ROLE LAB001_ROLE;
GRANT ALL ON FUTURE TABLES IN DATABASE LAB001_DB TO ROLE LAB001_ROLE;
--File Formats
GRANT ALL ON STAGE LAB001_DB.SCOOBY_SCHEMA.SCOOBY_ASSETS TO ROLE LAB001_ROLE;
GRANT ALL ON FILE FORMAT LAB001_DB.SCOOBY_SCHEMA.CSVFORMAT  TO ROLE LAB001_ROLE;

-- Grant role to user
GRANT ROLE LAB001_ROLE TO USER IDENTIFIER($MY_USER);

3. Snowpark ML Data Ingest

Stage. In this step, I first create the Snowpark session; and for easier access later on, I create creds.json in my project to save these credentials. Remember that this is a hack only for this pet project, with a trial account, as well as credentials that are saved as plain text not meant for serious use cases!
Read into dataframe. Next it was time to ingest my data into snowflake from a local csv file. I first staged it with Snowpark’s session.file.put into the stage we created in the first step. Then, I defined a StructType object with the schema to be able to read from the stage into a Snowpark dataframe with session.read.options. Finally, I cast the data types as everything was originally ingested as String. I decided to use the TRY_CAST function. It was straightforward to create quick for loops and utilise Snowpark functions to perform type casting to various data types.
```
int_array = ["ENGAGEMENT","RUN_TIME","MONSTER_AMOUNT",
             "SUSPECTS_AMOUNT","CULPRIT_AMOUNT","SPLIT_UP",
             "ANOTHER_MYSTERY","SET_A_TRAP","JEEPERS","JINKIES",
             "MY_GLASSES","JUST_ABOUT_WRAPPED_UP","ZOINKS",
             "GROOVY","SCOOBY_DOO_WHERE_ARE_YOU",
             "ROOBY_ROOBY_ROO"]

for col in int_array :
    scooby_df = scooby_df.with_column(
      col, F.try_cast(col,IntegerType()))
```
Explore and clean data. The data is ingested and it is time to start the famous EDA (exploratory data analysis). This step involves diving deep into the data to understand what we have. It’s where we discover if there are any values that require cleaning before we can dive into the machine learning fun. With the Scooby Doo dataset, there are several categorical values that definitely need cleaning. For example, few episodes have more than one monster; hence we had arrays of values in some of the features like MONSTER_TYPE. For example: Super-Villain, Sea-Monster, Possessed Object, Super-Villain.In order to simplify the data and be able to better encode it for machine learning, I did some data cleaning that transformed the previous into: POSSESSEDOBJECT_SEAMONSTER_SUPERVILLAIN.Remember that Snowpark’s dataframe behaviour is similar to pandas’ dataframe, and something I really like from these is the ability to chain method calls when I need to apply several transformations. This took me to the following statement:
```
def unique_array(column):
    ar = F.array_to_string(
        F.array_sort(F.array_distinct(F.strtok_to_array(
            F.regexp_replace(F.rtrim(F.upper(F.col(column)),F.lit(",")), '[^a-zA-Z0-9|,]+', '')
            ,F.lit(","))))
        ,F.lit(","))
    return F.iff(F.col(column) != 'NULL', ar , None)
```
This basically takes the value of the column, applies an upper case to it, right trims it to get rid of “,” at the end of the value and then applies a regexp_replace to get rid of non numerical or letter characters. This is all done in that order in line 4 of the code above. On top of that, we apply a strtok_to_array to convert the comma separated string into an array. We can then do an array_distinct to get rid of duplicated values and sort them. Finally, we convert all this clean array into an string separated by commas.

Ok maybe this is a bit too much (just maybe), but it is to demonstrate the capabilities and usefulness of chain method calls.
Write to Snowflake. Once we are happy with the data cleaning, it is time to write the data to a snowflake table and get ready for the next step where the ML fun begins.

Snowpark ML Feature Transformations

We have done our exploratory data analysis, we have identified what needed cleaning and we have prepared our data, it is time to start experimenting with the Snowpark ML feature transformations capabilities and set everything for training our model:

Data loading. First of all, we created our Snowpark session. Remember that creds.json file we hacked-created above? Time to use it and avoid retyping our account, user and password information. We read from the table in Snowflake and get again our Snowpark dataframe ready.
Feature transformations. It is time to test the first Snowpark ML library that will help us with the transformations: snowflake.ml.modeling.preprocessing In here I tried mainly two methods:
1. For the categorical values I used OneHotEncoder, this library transforms each categorical feature as a one-hot numeric array. For example, for a category such as ‘FORMAT’ that might have three values: TV_SERIES, MOVIE, CROSSOVER the OneHotEncoder will create 3 features with a boolean array of 1 if the episode is that format or 0 if the episode is not that format.
2. For the numerical values I used MinMaxScaler, which scales each feature to a given range (in this case between 0 and 1). By using this scaling method we ensure that the values are within a fixed range and that they contribute equally to the analysis. We also prevent numerical instability that can occur when dealing with extreme small or large numbers and it also helps with the performance optimisation of the algorithm.

Preprocessing pipeline. Once I identified the features I want to use and the preprocessing techniques I want to apply to each of them, it is time to build the preprocessing pipeline using Snowpark’s ML Pipeline object from snowflake.ml.modeling.pipeline We will use this pipeline later on when we build our ML model.

CATEGORICAL_COLUMNS = ["FORMAT","NETWORK","SETTING_TERRAIN",
                       "MOTIVE","MONSTER_GENDER",
                       "CULPRIT_GENDER"]
CATEGORICAL_COLUMNS_OE = ["FORMAT_OHE","NETWORK_OHE",
                          "SETTING_TERRAIN_OHE","MOTIVE_OHE",
                          "MONSTER_GENDER_OHE",
                          "CULPRIT_GENDER_OHE"] 

NUMERICAL_COLUMNS = ["ENGAGEMENT","RUN_TIME","ZOINKS","GROOVY",
                     "SCOOBY_DOO_WHERE_ARE_YOU",
                     "ROOBY_ROOBY_ROO"]
NUMERICAL_COLUMNS_NORM = ["ENGAGEMENT_NORM","RUN_TIME_NORM",
                          "ZOINKS_NORM","GROOVY_NORM",
                          "SCOOBY_DOO_WHERE_ARE_YOU_NORM",
                          "ROOBY_ROOBY_ROO_NORM"]

# Build the pipeline
preprocessing_pipeline = Pipeline(
    steps=[
            (
                "OHE",
                snowml.OneHotEncoder(
                    input_cols=CATEGORICAL_COLUMNS,
                    output_cols=CATEGORICAL_COLUMNS_OE
                )
            ),
            (
                "MMS",
                snowml.MinMaxScaler(
                    clip=True,
                    input_cols=NUMERICAL_COLUMNS,
                    output_cols=NUMERICAL_COLUMNS_NORM,
                )
            )
    ]
)

PIPELINE_FILE = 'preprocessing_pipeline.joblib'
joblib.dump(preprocessing_pipeline, PIPELINE_FILE)

Snowpark ML Model Training

In this last step of the project, we train our model and identify the hyperparameters that deliver the best model performance. Once we are happy with the model, we will deploy it into a Snowflake user-defined function (UDF) so it is readily available to be used for predictions.

Data Loading. Similar to our previous steps, we use the creds.json file to start our session and read the cleaned data from Snowflake into a Snowpark dataframe. We also load the preprocessing pipeline object that we created in the last step. We also create arrays with the names of the features that we are going to use in the model. Remember that in our preprocessing pipeline we give different treatment to different data types: categorical features = one hot encoding, numerical features = min max scaling.For this reason, we create two arrays per data type that will hold the input and output features in each of the steps of the preprocessing pipeline. We also identify our label feature and the output column:

CATEGORICAL_COLUMNS = ["FORMAT","NETWORK","SETTING_TERRAIN",
                       "MOTIVE","MONSTER_GENDER",
                       "CULPRIT_GENDER"]
CATEGORICAL_COLUMNS_OE = ["FORMAT_OHE","NETWORK_OHE",
                          "SETTING_TERRAIN_OHE","MOTIVE_OHE",
                          "MONSTER_GENDER_OHE",
                          "CULPRIT_GENDER_OHE"] 

NUMERICAL_COLUMNS = ["ENGAGEMENT","RUN_TIME","ZOINKS","GROOVY",
                     "SCOOBY_DOO_WHERE_ARE_YOU","ROOBY_ROOBY_ROO"]
NUMERICAL_COLUMNS_NORM = ["ENGAGEMENT_NORM","RUN_TIME_NORM",
                          "ZOINKS_NORM","GROOVY_NORM",
                          "SCOOBY_DOO_WHERE_ARE_YOU_NORM",
                          "ROOBY_ROOBY_ROO_NORM"]

LABEL_COLUMNS = ['IMDB']
OUTPUT_COLUMNS = ['IMDB_PRICE']

scooby_ml_df = scooby_df.select(LABEL_COLUMNS + CATEGORICAL_COLUMNS + NUMERICAL_COLUMNS)
scooby_ml_df.show()

Build Base Model. In this step is where we create the baseline model. We are going to use an XGBoost regression model. XGBoost is an algorithm that uses gradient boosted decision trees. The name XGBoost stands for eXtreme Gradient Boosting. This algorithm uses an ensemble learning method, which basically means that it runs an “ensemble” or group of decision trees and combine its results. There are two ways to use ensemble methods: bagging (which runs the decision trees in parallel and averages the results for the final prediction), and boosting (which run sequentially where the next decision tree uses as a base the output of the previous decision tree and the algorithm aims to reduce the errors in each iteration). The first step of this step is to divide our dataset in train and test datasets, in this case we will take 90% of the rows for training and the remaining for testing:
```
# Split the data into train and test sets
scooby_train_df, scooby_test_df = scooby_ml_df.random_split(weights=[0.9, 0.1], seed=0)

# Run the train and test sets through the Pipeline object we defined earlier
train_df = preprocessing_pipeline.fit(scooby_train_df).transform(scooby_train_df)
test_df = preprocessing_pipeline.transform(scooby_test_df)
```
Then we prepare the column names of the features we are going to use in the regressor: one hot encoded for the categorical variables and normalised for the numerical values. We build the regressor using from XGBRegressor from snowflake.ml.modeling.xgboost:
```
# Obtain the column names of the dataset for the Regressor, we are going to exclude the None values
CAT_COLS = [k for k in train_df.columns if ('_NORM' in k) & ('None' not in k)]
NUM_COLS = [k for k in train_df.columns if ('_OHE' in k) & ('None' not in k)]

# Define the XGBRegressor
regressor = XGBRegressor(
    input_cols=CAT_COLS + NUM_COLS,
    label_cols=LABEL_COLUMNS,
    output_cols=OUTPUT_COLUMNS
)

# Train
regressor.fit(train_df)

# Predict
result = regressor.predict(test_df)
```
We can analyse the result using the Snowpark ML mean_absolute_percentage_error from snowflake.ml.modeling.metrics:
```
# We can analyse the results using Snowpark ML's MAPE
mape = mean_absolute_percentage_error(df=result, 
                                        y_true_col_names="IMDB", 
                                        y_pred_col_names="IMDB_PRICE")

result.select("IMDB", "IMDB_PRICE").show()
print(f"Mean absolute percentage error: {mape}")
```
Find optimal model parameters. Once we build the initial base model and obtained a metric to measure its performance (in this case the mean absolute percentage error), it is time to find the best parameters for that model that gives us the best performance. For this, we use the native Snowpark ML GridSearchCV from snowflake.ml.modeling.model_selection:
```
grid_search = GridSearchCV(
    estimator=XGBRegressor(),
    param_grid={
        "n_estimators":[100, 200, 300, 400, 500],
        "learning_rate":[0.1, 0.2, 0.3, 0.4, 0.5],
    },
    n_jobs = -1,
    scoring="neg_mean_absolute_percentage_error",
    input_cols= CAT_COLS + NUM_COLS,
    label_cols=LABEL_COLUMNS,
    output_cols=OUTPUT_COLUMNS
)

# Train
grid_search.fit(train_df)

# Use to_sklearn to access the estimator object 
grid_search.to_sklearn().best_estimator_
```
We obtained that the best estimator is at n_estimators = 100 and learning_rate = 0.1. The previous mape was: 0.05170313160722504. With the best estimator it is: 0.050110345648975284

Deploy the model using vectorised UDF. The Snowpark ML’s function model.predict() creates a temporary UDF. In order to persist this and make it available within Snowflake, we need to define our own UDF. Once Snowpark ML’s native registry model is available, this will be the most streamlined approach to deploy the model. For now, we need to save the model as an sklearn object so it can be used externally. First we save our model and father all the feature names:

# Let's save our optimal model first
optimal_model = grid_search.to_sklearn()
MODEL_FILE = 'model.joblib'
joblib.dump(optimal_model, MODEL_FILE) # we are just pickling it locally first

# You can also save the pickled object into the stage we created earlier
session.file.put(MODEL_FILE, "@SCOOBY_ASSETS", overwrite=True)

# Get all relevant column names to pass into the UDF call
feature_cols = test_df[CAT_COLS + NUM_COLS].columns

Then we create the vectorised User Defined Function (UDF):

# Cache the model load to optimize inference
@cachetools.cached(cache={})
def load_model(filename):
    import joblib
    import sys
    import os

    IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
    import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

    if import_dir:
        with open(os.path.join(import_dir, filename), 'rb') as file:
            m = joblib.load(file)
            return m

# Register the UDF via decorator
@udf(name='batch_predict_imdb', 
     session=session, 
     replace=True, 
     is_permanent=True, 
     stage_location='@SCOOBY_ASSETS',
     input_types=[F.FloatType()]*len(feature_cols),
     return_type=F.FloatType(),
     imports=['@SCOOBY_ASSETS/model.joblib.gz'],
     packages=['pandas','joblib','cachetools','xgboost'])
def batch_predict_imdb(test_df: pd.DataFrame) -> pd.Series:
    # Need to name the columns because column names aren't passed in to this function
    test_df.columns = CAT_COLS + NUM_COLS
    model = load_model('model.joblib.gz')
    return model.predict(test_df) # This is using the XGBoost library's model.predict(), not Snowpark ML's

Finally we can call our UDF to predict the IMDB in our test data.

test_df_w_preds = test_df.with_column('PREDICTED_IMDB', batch_predict_imdb(*feature_cols))
test_df_w_preds.show()

This is deployed now in Snowflake, ready for use!

Closing

In few clear steps, we went from data in a .csv, to predictions using Snowflake power. I really enjoyed working with familiar structures such as Snowpark dataframes, scikit libraries for data transformation and model building. I was pleasantly surprised by the neat way of putting together the preprocessing pipeline: identify your features, the treatment you are going to give to them and build the pipeline! In regards of deployment, at the moment we needed to go through the step of transforming into a sklearn object; but as mentioned, the Snowflake model registry will make things easier. Looking forward to it!

I hope that you find this walk through useful and that you are ready to test this workflow in your own data. Nothing beats to test new features with your own projects and see how can you best harness the Snowflake power. Happy experimenting!

Research

The post ML in Snowflake with Snowpark for Python appeared first on InterWorks.

Sentiment Analysis With Native Algorithms in Dataiku

Azucena Coronel — Thu, 03 Aug 2023 20:34:41 +0000

This is the third part of a short series called “Natural Language Processing with Dataiku,” where we are analysing airline reviews. In the first part, we did very basic data preparation, used the Dataiku sentiment analysis plugin and evaluated the model with the help of a confusion matrix. In the second part, we applied some NLP pre-processing tasks and ran the Dataiku sentiment analysis again. Upon evaluating once more, we were able to see how the performance increased. In this third section, we will use the native Dataiku machine learning (ML) algorithms to predict polarity based on text.

Data Preparation

To start with, we will divide our prepared dataset in two datasets for training and testing. If you were following the previous two blog posts, you will see that we applied some text pre-processing techniques to the column “customer_review.” For the purpose of this exercise, I have slightly modified the pipeline to keep “customer_review” with the raw data and created a new field, “customer_review_simplified,” with the pre-processed text. To easily achieve this, we will use a split recipe to randomly split the dataset. As we are using a random way to dispatch the data, it is very important to select “Set a random seed” so that it always does it the same way and the results are reproducible:

To make sure that the train and test datasets remained balanced (having the same percentage for positive and negative reviews), you can use the “Analyze” capability of Dataiku. To do this, double click in the train data set, scroll to the “recommended_label” field and click on “Analyze” like you see here:

Select “Whole Data” and click “Compute.” We can see that the datasets are not too unbalanced:

Creating a Baseline Model

The first model that we are going to create will use the “customer_review” field (without any preprocessing) to predict “recommended_label.” Dataiku has powerful algorithms within its AutoML functionality. To access this, select the “train dataset/lab/AutoML Prediction” to create a prediction model on “recommended_label” like you see here:

Dataiku rejects text features by default, so we need to manually go in the Design tab and reject all features except “customer_review.” As this is the first experiment, we will leave all the defaults including:

“Text handling = Term hashing + SVD”
“Algorithms to test = Random Forest”
“Logistic Regression”
“Hyperparameters = default”

Click “Save,” then click “Train.” We will name this “training session = baseline” like you see here:

In this case, optimising for the metric “ROC AUC,” the Logistic Regression performed better than the Random Forest:

In the previous blogs of this NLP with Dataiku series, we tested the predictions and produced a confusion matrix to visually see the correct predictions percentages. In this exercise, we will use the Evaluation Store and focus on the metric we are optimising for “ROC AUC” and also on the “Cost Matrix Gain.” We can still see each of the confusion matrices if we need to by double clicking in each of the individual evaluations.

Deploying the Model and Testing

The next step is to deploy our Logistic Regression model in the pipeline to be able to evaluate it. To achieve this, click in the Logistic Regression Model, and hit “Deploy” in the top right:

This will deploy two objects to our Dataiku flow: the train recipe and the model, visualized here:

The next step is to use that model to score our test dataset. For this, click on the “airlines_review_test” dataset and then on the recipe “Predict.” Give a good name to this scored dataset, in my case I used “airlines_review_test_scored_baseline_logreg.” Click “Run” to predict:

Once this is predicted, we will use an “Evaluate” recipe and keep the results in an Evaluation Store to be able to easily compare the results:

Second Experiment With Pre-Processed Text

Our next experiment will use the same AutoML configurations like the baseline modeling, but using our “customer_review_simplifed” feature. As a reminder, we used Dataiku’s pre-processor “Simplify Text” to get the stem of the words and to remove the stop words. If you want to read more about why we would like to do this, please read the second part of this series.

Above: A view of the Simplify Text feature from the second blog post.

After retraining the model with this “customer_review_simplified” feature, we deployed to the flow (updating the current model) and ran the full pipeline to be able to compare the results in the Evaluation Store. As we can see from the previous optimisation screenshot, the metric we are optimising for ROC AUC got slightly better. We are on a good track to optimise the performance of this polarity prediction.

You’ll notice here there are two evaluations with this second model. This is due to us running it with “customer_review_simplified” ordered both alphabetically and not alphabetically, which didn’t make significant difference in the performance:

Third Experiment: Adding New Features

In the spirit of science (hence the data “science” term), let’s do another experiment. This time, we will add few new features based on the customer_review.

First, we will add two simple features: “length_raw” and “length_simplified,” which will count the number of terms in each of the fields:
Then, we will add the ratio between these two lengths and call the new feature “length_ratio,” like you see here:

Once these features are added in the initial data prep step, we’ll run the split recipe as well to surface these new features to “airlines_review_train” and “_test” datasets and use them in the next model training session. We call the next training session “extra_features” like so:

The Logistic Regression algorithm again gets the higher ROC AUC, so we deploy this new model to the flow, use it to score our test dataset and evaluate it. These new features didn’t increase the performance of ROC AUC, but they slightly increased the cost matrix gain:

So, we’ll take the increase and move on to our next and final experiment.

Fourth Experiment: Handling Text Features in the ML Model Design

We have added a few extra features in the previous steps. Now, we will experiment with the different techniques to handle text features in the ML model. Our first baseline we left it with the default value ‘Tokenization, hashing and SVD.” This time we will select “Count vectorization” in the Design tab:

Then we do our workflow one more time:

Train (Name the session “Count_vectorization,” do not drop and recreate the train and test datasets as the features haven’t changed, accept the warning for sparse feature)
Select the model with highest performance
Deploy to our flow
Score the test dataset
Evaluate the results

It is possible that random forest can’t run with this text handling technique due to memory issues. Count vectorization basically creates an occurrence matrix per term, which will be pretty large and sparse. Random forest algorithms are not great for these large and sparse matrices, as each tree can be really deep and have thousands of nodes which will make the memory consumption of the Random Forest grow very fast:

On the good side, we can see that with this new text handling technique our ROC AUC increased. Once we deploy, score the test dataset and evaluate the performance. We can see the full details compared to the previous experiments in the evaluation store:

With a few experiments, we were able to increase the performance of our ML model. We experimented with new features and with different text handling techniques, but it doesn’t need to stop there. Dataiku offers varied algorithms natively within the platform, and several hyperparameters to handle within each of these for you to try different approaches. If this is not enough, Dataiku also allows your custom python models to be imported to the platform and tested through the same workflow we demonstrated here. With Dataiku, we can do as much experimenting as we need, and it provides us with the framework to easily compare between experiments and go back to versions when we need to. Dataiku makes data science easier!

Are you excited to try out Dataiku? Contact us to chat about how we can empower you to start your data science journey today.

The post Sentiment Analysis With Native Algorithms in Dataiku appeared first on InterWorks.

Create Real ROI with Data Science

Azucena Coronel — Tue, 25 Jul 2023 15:23:54 +0000

Our new Solutions Spotlight series will focus on client outcomes, use case solutions, impactful new technology and best practices across data and analytics. We’ll cover strategy, governance, user adoption, finding ROI quickly, architecture, agile development, tips and tricks, and more.

As part of these series, in our latest webinar we talked about creating real ROI with data science, and (spoiler alert) we do this by focusing on the business value that our initiatives are going to bring. Let’s explore more!

Adding in the Data Science

We started talking about the percentage in which companies have adopted AI globally. In Australia for example, 24% of companies have deployed AI and 44% are exploring. This leaves us with 32% of companies which potentially haven’t started exploring yet the benefits that AI and ML can bring to the organisation. There might be several reasons for this, such as companies not having the right skillset in house or them having other pressing priorities to focus on. That said, it is important to understand the benefit that these technologies can bring and, moreover, to have a framework to deliver value to the business in an agile way.

This framework starts with being able to identify the different use cases and prioritise in terms of value to the business and effort of implementation. It is recommended to start with the highest value and lowest effort use cases to showcase early on the benefits of using ML:

Increasing ROI

As part of this framework to shorten time to value, we talked about one of the current challenges of organisations aspiring to do data science: every data analyst and data scientist might be using their own preferred tools and languages, which makes it difficult for the organisation to enable team collaboration and having a sustainable way of productionalising and monitoring the models later on. This is where our technology partner Dataiku comes in.

On one side, Dataiku enables people from all skill levels and working styles through their unique combination of visual recipes for the low-code/no-code audience, all while catering to the people that prefer coding with their code recipes, notebooks and code environments. On the other side, Dataiku enables the full machine learning project lifecycle—from ideation and exploratory data analysis, to data preparation and model experimentation, to getting the selected models into production and monitoring them afterwards—all under one roof, which means streamlined processes that allows you to focus on more important things: delivering new use cases and capabilities to the business.

With the help of a framework, the right toolset and a partner that supports you from zero to ML excellence, you are on the right path to deliver real business value through data science:

Are you excited to explore more? Contact us to chat about how we can empower you to start your data science journey today.

The post Create Real ROI with Data Science appeared first on InterWorks.

Challenges of NLP and Solutions with Dataiku

Azucena Coronel — Tue, 25 Apr 2023 18:39:28 +0000

In the previous post (that you can read here), we started doing some analysis on airline reviews. We did very basic data preparation, used the Dataiku Sentiment Analysis plugin and evaluated the model with the help of a confusion matrix. In this blogpost, we will go one step further and apply some Natural Language Processing (NLP) pre-processing tasks, then we will use the Dataiku sentiment analysis plugin again and compare the results with the first experiment.

Challenges of NLP

Human language is unstructured and messy. Machine learning is based on trying to find patterns in the training data. The challenge of NLP is to turn raw text data into features that an ML algorithm can process and search for patterns. The most basic approach of turning this messy, unstructured data into features is to consider natural language as a collection of categorical features, in which each word is a category of its own. So, for example, these three sentences…

“The airline lost my luggage.”
“This is what happens when traveling with such low quality airlines.”
“Airlines like this should be banned”

…will have 22 features:

Because of the unstructured and messy nature of human natural language, we face few challenges such as:

Sparse features:
You will notice that the features of these three sentences are very spare. The words as they are only appear in one single sentence. This sparsity will make difficult for the algorithm to search for similarities between sentences and find patterns.
Redundant features:
The three sentences are talking about an airline/airlines, but given that two of those words are plural and one is capitalized, without some form of pre-processing, these are taken as three separate features.
High dimensionality:
These three short sentences generated 22 features. If we would go to analyse a full paragraph, full article or even a full book, you can just imagine how many hundreds or even millions of features we would end up with. This high dimensionality is a problem because the more features, the more storage and memory you need to process them. This is why for text analysis we would ideally apply some techniques to reduce the dimensionality.

Pre-Processing Steps To Deal With These Challenges

To deal with these three problems, there are three basic and very valuable data cleaning techniques for NLP:

1. Normalizing text

The objective of this technique is to transform the text into a standard format, which includes:

converting all characters to the same case
removing punctuation and special characters
removing diacritical accent marks

So in the case of our three sentences with this step we can go from 22 features to 20:

2. Removing stop words

Stop words are a set of commonly used words in any language. In this step, we remove them to be able to focus on the important words that convey more meaning.

In our example, this step takes us from 20 features to 10.

3. Stemming

This step transforms the words to its “stem” word. So, for example, “airlines” and “airline” will become “airlin.” Now, we will go down to nine features.

Cleaning Our Airline Reviews

Let’s apply some of these techniques to our airline reviews, and then process them again with the Sentiment Analysis plugin to compare the results with the previous run.

1. The first step is to analyse the data. We can use the values clustering on the “customer_review.” Dataiku identifies few clusters with very similar reviews:

2. From the previous analysis, we can see that few of the records have the first sentence of the review containing if the trip is verified or not. This is not adding any value to the sentiment analysis, so the first pre-processing step that we want to do is to get rid of that part of the first sentence. For this, we use a “Find and Replace” data processor to replace the strings:

âœ… Trip Verified | → No Value
âœ… Verified Review | → No Value
Not Verified | → No Value

Just for reviewing the recipe, I’m using an output column, but before running and saving I’ll delete so that the replacement is done in place. The rest of the flow uses the feature “customer_review.”

3. The next step is to apply the Simplify text processor, which contains four different processes for simplification of text columns:

Normalize text: Transform to lowercase, remove punctuation and accents and perform Unicode NFD normalization (like Café -> cafe).
Stem words: Transform each word into its “stem,” i.e. its grammatical root. For example, “grammatical” is transformed to “grammat.” This transformation is language-specific.
Clear stop words: Remove so-called “stop words” (the, I, a, of, …). This transformation is language-specific.
Sort words alphabetically: Sorts all words of the text. For example, “the small dog” is transformed to “dog small the,” allowing strings containing the same words in different order to be matched.

Once more, I used the output column as “customer_review_simplified” for reviewing, but before running the recipe, I will delete it so that the simplification is done on the same column, “customer_review.”

Run the ML Model With the Clean and Simplified Reviews

Once we finished our pre-processing, let’s run again the model using the Dataiku Sentiment Analysis plugin and compare the results. One of my favourite components of Dataiku is the Model Evaluation Store. This component saves the metrics of the evaluation every time it runs. With this, it is very easy to see the previous runs and compare the performance. Let’s look at the confusion metrics from the previous blogpost and the current run after we cleaned and simplified the text:

Previous Run (2023-03-11 00:44:53)	Current Run (cleaned and simplified customer reviews) (2023-03-22 19:30:57)

As you can see above, we did increase all our metrics and we went to positive territory on the average gain per record. This pre-processing for natural language processing was well worth it. A part of data science is all about experimenting, and while we did experiment by doing pre-processing, there are still more ideas to explore! Have a look inside the Dataiku Sentiment Analysis plugin, use other algorithms and apply other pre-processing techniques. With Dataiku, we can do as much experimenting as we need, and it provides us with the framework to easily compare between experiments and go back to versions when we need to. Dataiku makes data science easier!

Are you excited to try out Dataiku? Contact us to chat about how we can empower you to start your data science journey today.

The post Challenges of NLP and Solutions with Dataiku appeared first on InterWorks.

Sentiment Analysis with Dataiku

Azucena Coronel — Fri, 24 Mar 2023 21:20:10 +0000

This is the first in a blog series aimed at highlighting the use of natural language processing in Dataiku. Stay tuned to the blog for more installments coming soon!

The Trip

I recently travelled from Sydney to Mexico to visit my family, and I had my last flight leg (Dallas -> Mexico) cancelled due to bad weather. As the frustration of all those affected by this overflowed and people complained loudly to the customer service members, I wondered what kind of reviews this airline was going to have with so many angry customers. This wondering evolved into me actually wanting to analyse the data, and see what I could find.

One of my favourite data science areas is natural language processing (NLP). To me, making a computer understand natural language is something of utter importance with all the applications down the line. We have things like sentiment analysis, topic analysis, chatbots and more with different applications from business to social good. I’ve heard of few chatbots for social good trialed in community engagement, public services FAQ and even personalised education chatbots. Recently, I read about a project where topic analysis is being applied to medical research papers in order to help categorise them in a more efficient and reliable way. NLP aides in add useful numeric structure to text data, helping to resolve ambiguity in language and many more applications to make better use of our text data around the world.

With the inspiration that natural language processing brings, and the idea of analysing the reviews of different airlines around the world, I set out to find a dataset that could give me something to try, and of course Kaggle didn’t disappoint. This is a great dataset to get started with some NLP analysis, the Skytrax Airline Reviews.

The pipeline for this project ended up looking like this:

Data Ingestion

Dataiku gladly surprised me. I initially investigated the dataset with external Python scripts, and it had half of the records emtpy. With Python, I cleaned it easily. But with Dataiku, when I grabbed the .xlsx file and uploaded it to Dataiku, I literally didn’t have to do anything. It cleaned these empty records automatically!

Above: Some of the cleaned records.

Initial Data Prep

Before using the evaluation recipe, there is some data clean up that I need to do. I had to clean up the recommended label and the review date:

I created a feature recommended_label and map yes = positive and no = negative:

I might want to later do some analysis with the customer review date, so I first cleaned it using a regex (to get rid of the letters from the day, eg: 3rd, 4th, 21st), and then, upon analyzing I discover few typos with Augu (instead of August), I cleaned that as well. Then, I used the Parse Date recipe to get the date parsed:

Above: Before.

Above: And after.

Using the Sentiment Analysis Plugin and Preparing to Evaluate Its Results

One of the advantages of Dataiku is the ability to extend its power with the use of plugins. A plugin is a package of components that can be reused and even shared with others. Dataiku has a growing number of plugins that can get you started with different types of analysis such as geospatial, NLP, deep learning, computer vision amongst many more. You can find the current list of available plugins here.
In this case I wanted to do a sentiment analysis on the customer reviews, and I found a great plugin to try out called, fittingly, “Sentiment Analysis.”

I used the recipe in the customer reviews with a binary output (positive/negative) and prediction confidence scores. Upon skimming through some records, it appeared the prediction is doubtful. I got records labelled as positive, with overall recommendation being “no.” I wanted to investigate further and put some statistics around this. I’m interested in visualising a confusion matrix using the feature recommended as a proxy to evaluate, but of course acknowledging that a more accurate way would be to have a human labelled dataset with “positive/negative” noted per record.

Above: What I settled on for this section.

In order to perform the evaluation, I needed to do few more data prep steps after the prediction. There were a few records that did not have a value in the “recommended” feature, so I filtered out those ones as we will not be using them for our model evaluation:

The evaluation recipe utilises a python numpy function np.isnan() to detect if there are empty values in the records used to evaluate. As our current classes have string values: “positive” and “negative,” I utilised an extra prep step to map these to “1” and “0” in order for the Evaluation step to work correctly. I also made sure that the storage type of both columns were integer, as I got a python error when I accidentally left it as string:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced

The evaluation recipe in Dataiku takes an evaluation dataset and a model. But, because a plugin is considered an “External Model,” I needed to use a SER (Standalone Evaluation Recipe.) This recipe uses the evaluation dataset as input, and its output is an Evaluation Store. Every time the evaluation runs, a new Model Evaluation is added to the Evaluation Store. The evaluation dataset will need to contain the predicted value. Most of the time, a field with the “ground truth” will be necessary to compare and get some model performance metrics, but even if there is no “ground truth” field, it is still possible to utilise the SER component as some of the analysis (like drift and prediction drift) remain possible.The sampling and evaluation naming I left it default.

The model part I configured:

prediction type = two-class classification as we are predicting positive/negative
prediction column = empty, as it is a classification problem we used the Probabilities section to put the different classes being predicted
labels column = recommended_label_n
Probabilities = 1 and 0; both in the column predicted_sentiment_n

Above: The configured parts

For the cost matrix, I assigned the same weights for true and false, as I’m giving the same priority to both predictions. In some other use cases, it is possible to play with these values if the business penalises more false negatives, false positives, etc. Talking about false negatives, and false positives, this is one of my favourite stats and data science memes of all times, as it clearly explains these concepts:
For example: In a customer churn model, I could consider a higher cost to predict a customer as not churning, and they effectively churn (Type II error, false negative), so I could actually set this “false, but it is actually true” as a gain of -5 considering that once a client has churned it is very difficult to get it back.

Analysing the Model Evaluation

The first time it runs, the threshold of 0 was found by optimising the F1-score, which is the harmonic mean between precision and recall.

As a side note:

Precision: Proportion of correct (“positive“) predictions among “positive” predictions.
Recall: Proportion of (correct) “positive“ predictions among “positive” records.

For those that want to dive a bit deeper in the stats behind this, here is a fantastic resource that discusses the evaluation metrics in Dataiku

Looking at the confusion matrix for this first run, it didn’t look overly encouraging. The F1-score, precision and recall hit nearly a ~19%. The accuracy only 15%.
Our dataset is fairly balanced with ~51% positive reviews and 49% negative ones. So in this case we can count on accuracy as a fair metric to evaluate this model, in highly unbalanced datasets accuracy is not a great metric. For example in a dataset where 90% are positive reviews, the accuracy of predicting positive might be 90% or more; but it would not automatically be a great model because it would almost never predict the negative class.

Above: The Confusion Matrix

Although the predictions didn’t really hit the mark in this first go, it inspired me with several ideas that I want to investigate. On one end, I’d like to investigate what the plugin is doing and review if there are few parameters that I could try out to see if I can increase the accuracy. On the other side, I really want to investigate the customer reviews with a few other techniques. One NLP technique that I have applied before is called, “bag of words.” The bottomline is, you start by extracting features or words from the text, dismissing all grammatical structures and the order of the words. The idea is just to obtain the words and their counts within the text. Then, these are compared with two baseline datasets (or bags of words), one of them containing only words associated with being positive and the other one having the words associated with being negative. This helps to give the customer review a score based on how many positive or negative occurrences the text has.

Natural Language Processing is quite a difficult task, but also a very exciting one with lots of possibilities. A part of data science is about experimenting with the data and the algorithms, and searching for ways where we can get better predictions. Dataiku offers a great framework and tool to make this experimentation easier! Once I’m ready to start another experiment, or move features/parameters in my pipeline, I will run it with the new configuration, evaluate it and be able to easily come back and compare it with this one I first did.

Are you excited to try out Dataiku? Contact us to chat about how we can empower you to start your data science journey today.

The post Sentiment Analysis with Dataiku appeared first on InterWorks.

DataWomen Event Recap: IEEE Mixer

Azucena Coronel — Mon, 07 Nov 2022 16:40:14 +0000

On Friday 24 October, we were invited to an IEEE NSW Education Chapter and DataWomen university mixer where we had the privilege to talk to information technology students of all levels about all the different roles they can pursue in the IT field. We paired that with examples of the roles we have here at InterWorks, the different activities we perform in our day to day and how our diverse toolset fits together in the current modern data cloud.

Above: Carol Prins, Dr. Geng Sun, and me as presenters for the mixer.

How We Talked the Talk

We started off by sharing who we are as both DataWomen and InterWorkers. We were able to chat a bit about how InterWorks started, which you can find in this blog post here, and shed a little light on how a group of good friends having fun started what is now a global consultancy that lives by the mantra, “Best People, Best Clients, Best Work.”

Up next came the more technical aspects of the event, where we discussed our modern technologies stack and the InterWorks reference framework, diving into how the different tools line up with each other and within the different areas required for a great data strategy to make the best use of the data.

From there, we took a broader view of the different roles within our consultancy, which tools within each role we generally use and what type of activities we do as part of such roles. We had a goal for them to understand all the different possibilities when looking at starting a professional career in the tech consultancy field.

We started to wrap up by sharing some of our clients’ success stories, and how data have an impact within their day to day and in the long term. We love being able to draw real world parallels to the data they may interact with every day, but not realize it!

In the final leg of our presentation, we shared couple of stories of great, and sometimes unexpected, career transformations we’ve seen at InterWorks. For instance, Beth Kairys went from a psychologist to analytics consultant, and Colleen Chen jumped from lawyer to data scientist. We’ve embedded videos of their stories at the bottom of this post.

Helping Others Walk the Walk

After our presentation, we hung around with the students for some great conversations over pizza and soft drinks. We wanted to chat directly with the students about some options they have when they graduate, which can sometimes be a daunting question for both undergrad and grad students alike. Sharing our experiences and the technical areas we both like the most, as well as why we love all things data, gives them lots of hope for their careers to come.

Above: Part of the in person group for the mixer

All in all, it was a fantastic event. I’m very excited for the what the future holds for all these enthusiastic people! If you’re interested in learning more about DataWomen, check out our informational page here.

Above: Beth Kairys’s presentation

Above: Colleen Chen’s presentation

The post DataWomen Event Recap: IEEE Mixer appeared first on InterWorks.

DataWomen Event Recap: Everyday AI

Azucena Coronel — Tue, 02 Aug 2022 16:49:52 +0000

Information technology is one of the world’s fastest growing sectors. Data, analytics and technology have permeated through all aspects of our lives, and it is safe to say that in all sectors and industries, there are data problems to be solved. Just in Australia, the IT sector is forecast to grow by 5.4% on average per year, for a total of 1.1 million workers by 2026. This is more than four times the expected growth rate of the work force at large.

DataWomen is all about supporting and empowering women in data and our data friends. This aim of fostering a great community is what led us to put together this session. We started to think about topics like:

What in the world of data is bringing competitive advantages to our companies?
Why is it important that more females participate in this?
How does someone start progressing their career and benefiting from the untapped potential of exciting data roles out there?

Below, we revisit some of the highlights from of the recent event. You can catch the full recording at the bottom of the post!

What is Advanced Analytics?

We chatted about advanced analytics (AA) being the automatic or semi-automatic examination of data using specialised tools and techniques, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations. Using advanced analytics can make the difference for the companies stuck between keeping up with the competition or falling further behind.

We also talked about machine learning being the science of getting computers to learn from experience and perform tasks automatically. Basically, it works by combining large amounts of data with fast, iterative processing and intelligent algorithms, allowing the software to learn automatically from patterns or features in the data. With the current high performing cloud computing technologies, and the vast amount of data that is been collected; there has never been a greater time to experiment with machine learning.

There are so many great examples of how currently advanced analytics are being used. Nike, for example, uses predictive analysis to forecast consumer demand on a hyper-local level, which helps them to optimise their inventory and develop more targeted campaigns. Retail uses clustering techniques to create upselling and cross-channel marketing opportunities. Even in our everyday life, we make use of advanced analytics like, for example, how Google Maps uses predictive analytics to suggest to us the best route to our destination, or how our email providers use text mining and pattern recognition to be able to provide spam filtering options.

Gender Gap and the Motivation to Close it

“Today, women and girls are 25 per cent less likely than men to know how to leverage digital technology for basic purposes, 4 times less likely to know how to programme computers and 13 times less likely to file for technology patent. At a moment when every sector is becoming a technology sector, these gaps should make policy-makers, educators and everyday citizens ‘blush’ in alarm”

2019, UNESCO

DataWomen has always advocated for diversity. Different studies have shown that diversity increases the performance of a team by having different points of view to learn from. During our webinar, we talked about two examples that showed that lack of diversity in the technology groups developing two different products.

The first one was Apple releasing its new iOS 8 in 2014 which included a HealthKit app that promised to “let you see your whole health picture.” This app would be able to track everything from sleep, BMI and weight weight, to some obscure data such as selenium and copper intake. It was, however, missing the one thing that most of us females would be interested to track: our periods.

Our second example came from the Amazon ML specialists group that was working on developing a recruiting engine that would score the resumes from 1 to 5. In 2018, they discovered a big problem with said engine: it did not like women. The issue was that it was trained to vet applicants by observing patterns in resumes submitted to the company over a 10-year period. Most came from men, a reflection of male dominance across the tech industry. The engine was penalising resumes that included the word “women,” like in “women’s chess club” or “Women’s College.”

This demonstrated twofold the importance of closing the gender gap in AI/ML:

Awareness, education and skills. We need people to be informed of what AI is and how it works so that they can critically assess personal and societal risks and benefits from the impact of AI.
Increase awareness within the industry and address the issues of a lack of representation and bias within used training data.

What are We Going to do About it?

We can start by understanding that there are so many roles to participate in the information technology world! The possibilities are endless. If something interests you, explore it and see where it can take you. There are so many roles, like data engineers, data scientists or data analysts.

Some of the benefits of getting into the data spectrum are:

Almost every industry is moving towards AA and ML, which means plenty of opportunities!
Become eligible for a wide range of roles like data analyst, data scientist, data engineer, etc.
Career growth and strong pay scaling!
Flexibility to work remotely from anywhere around the world.
Plenty of amazing data science tools available in the market for upskilling.

How to get started, you ask?

Well, there are so many ways. There are plenty of free, or almost-free courses, to get started, plus meetups, volunteering opportunities, mentorship programs or data hackathons! Check out the recording of our session with Dataiku, Everyday AI, to get started.

Be part of the DataWomen Community!

DataWomen is intentional about relationship-building and empowerment. We cover meaningful topics in our events, like salary negotiations and productivity at home, host small group chats with amazing women in our community and support a mentoring program. You can learn more about our programming at our landing page, and check out our LinkedIn group to be the first to know what’s next!

We’re always on the lookout for stellar individuals to contribute to DataWomen and help it thrive. If that sounds like you or someone you know, reach out to Azucena Coronel, Beth Kairys, Carol Prins or Debbie Yu. See you next time!

The post DataWomen Event Recap: Everyday AI appeared first on InterWorks.

DataWomen Event Recap: Step up to 2022

Azucena Coronel — Thu, 11 Nov 2021 16:58:46 +0000

The end of 2021 is approaching fast, and with that comes a perfect opportunity to plan ahead and get ready to step up to 2022. DataWomen is all about supporting and empowering DataWomen and our data friends, and this aim is what led us to we put together this session. We started thinking about three important aspects of our life that can have far-reaching impact: communicating with power, backing ourselves up for the next role and taking control of our finances.

Below, we revisit some of the highlights from of the recent event. You can catch the full recording at the bottom of the post!

Communicate with Power

We’ve long been inspired by Tara Mohr, and in the communicating with power section of the event, we highlighted content from her book, Playing Big. We discussed socialized speech patterns women have that often undermine what they say in the workplace. This can be conscious or subconscious and includes habits like saying “just” and “actually”, apologizing profusely and similar behaviors.

While many women are aware of their own speech habits, they often feel forced to keep doing them because of the double bind: if they do not speak this way, they will be perceived as threatening and unlikable. Women who don’t adhere to these feminine speech norms often convey more confidence and competence, but they are also perceived as unlikable, threatening and intimidating, whereas these same speech patterns are deemed normal for men. The tradeoff for women is that they can be perceived as likable and nice OR competent in the workplace, but not both. To navigate this, we discussed ways of being mindfully warm and kind in the workplace without diminishing what you have to say. Check out the recording for more detail. We even included a Before You Hit Send checklist that helps in identifying and eliminating undermining speech in emails.

Be Your Own Best Advocate

In our second segment, Beth discussed backing yourself for your next role. We know our pitch for ourselves is important and crucial for answering questions like “Why should we hire you?” or “Why should I promote you / pay you more?”. We also know women struggle to pitch, generally undervaluing their efforts and experience and being less likely than their male counterparts to put themselves forward for roles. In this section, Beth walked through how to pitch for three different personas – ladies lacking technical experience, business experience or general experience. We also examined how to construct and master our pitch, including how to pivot in novel contexts. Check out this article for more. It’s an oldie but still a great resource.

Manage Finances with Excellence

In our last section on finances, we identified three steps for sorting out your money matters:

Get your finances in order.
Catch the surplus.
Make your money work harder for you.

A great plan starts by acknowledging where you are at this moment in time. This is why we began by identifying your current situation: identify all your income, which can look like a paycheck but also rental income, dividends, income from a side job or from the gig economy; identify all your outcomes, including bills and utilities, lifestyle expenses and yearly provisions; identify all your debts, such as credit cards and buy-now-pay-later setups; and identify all your investments, including any retirement plan or superannuation.

Once you identify these elements, set up everything to catch all the surplus. We talked about two frameworks to help us do this: the framework of Scott Pape, author of The Barefoot Investor, and the framework of Ben Kingsley and Bryce Holdaway, authors of Make Money Simple Again and hosts of the podcast The Property Couch. For more details on these structures, check out the recording or the resources below.

Finally, we talked briefly about how make your surplus work harder for you, and a good start is to review your current retirement plan or superannuation in order to build your egg nest for the future. There are several ways of making your money do more, like getting started with investments early and saving for goals that could lead you to have more money-making assets. Education around knowing your options is important and the resources below will help with that. Seeking the professional advice of a financial adviser is also highly recommended!

Explore the Resources We Shared

Resources we talked about during this event are:

Equity Mates | Get Started Investing Podcast
ABC | The Pineapple Project Podcast
Equity Mates | Talk Money to Me Podcast
Barefoot Investor | Steps to Financial Freedom
The Property Couch | Most Popular Downloads
Melissa Browne | Financial Resources
My Millennial Money | Podcast

We hope this session helped you get prepared for 2022 and thinking of how to keep growing personally and professionally! We’re certainly inspired and are looking forward to what the next year will bring us!

If you missed the session, check out the recording below:

Be a Part of the DataWomen Community!

The post DataWomen Event Recap: Step up to 2022 appeared first on InterWorks.

How to Stage Email Attachment Data with Matillion

Azucena Coronel — Wed, 21 Jul 2021 22:09:13 +0000

Transferring datasets via email is definitely not ideal, and while we try to avoid it as much as we can, there are occasions where our data providers do not have the resources to provide their datasets any other way. Maybe they are able to implement better, more elegant ways of providing their data such as APIs, file drops in cloud platforms like AWS or Azure, or DB accesses; but setting this up takes time, and they might not be prepared to deploy this immediately. The data refreshed in a daily fashion might be urgently required, making a workaround using email attachments data necessary.

Whatever the reason might be, if the need arises to stage email attachment data, Matillion comes to the rescue. With an easy recipe, we will be able to have our data staged in Snowflake, ready to be integrated with our other data sources and prepared for final consumption.

Stage the Email Details

Use the Matillion component Email Query to obtain the full details of the email and stage it in a table, which in our example is called DATA_EMAIL.

Important: There are two important connection parameters you might want to use:

IncludeMessage = true allows you to download the message body content and attachment data.
MaxItems = 500 is the maximum number of emails to be staged from most recent to oldest. The default is 100, so if you don’t modify this, it will always be giving you the latest 100 emails.

Pro Tip: Due to performance, you would like to avoid going for the full email dataset every time you run your pipeline. In a benchmark testing, doing an extraction where maxitems = 100 was completed in under 1.5 minutes, while an extraction on the same dataset with maxitems = 1,000 is completed in 14 minutes.

Ideally, you will stage the emails and perform incremental loads based on the email ID being greater than the max email ID in your database. For this purpose, we obtain in a Python component the max ID and save it in a job variable (in our example: v_elt_MaxID) to be used in the Email Query.

You can use the following Python code for this purpose:

cur = context.cursor()
qry_maxid = '''
SELECT IFNULL(MAX(ID),0)
FROM "${DestinationDatabase}"."${DestinationSchema}"."${DestinationTableName}_2"
'''
cur.execute(qry_maxid)
qry_rows_inserted = cur.fetchone()[0]
context.updateVariable('v_elt_MaxID',qry_rows_inserted)
print('Variable v_elt_MaxID has been updated to {0}'.format(qry_rows_inserted))

Also, make sure that your Email Query Load Options has set Recreate Target Table to Off:

Flatten the List of Attachments

For the emails that contain multiple attachments, this list will appear in your field DATA_EMAIL.ATTACHMENTS as a semicolon separated varchar. For example:

~WRD0001.jpg; image001.jpg; CustomerData_20210101.xlsx; image (23).png; DailySales_20210101.csv

We will need individual file names later on, so the next step here is to flatten out this ATTACHMENTS field:

Below is the SQL script:

SELECT 
att.VALUE
,ft.*
FROM "${DestinationDatabase}"."${DestinationSchema}"."${DestinationTableName}_2" ft
,lateral flatten(input => SPLIT("ATTACHMENTS",';'), outer => true) att

The resulting table, DATA_EMAIL_FLATATTACHMENTS, will contain one attachment per row. This table will be used later on to loop through it and obtain the file name that will be moved from the folder to a cloud storage location, such as S3.

Stage the Attachments Per Email ID

The next step is where we download the files into a Matillion box folder and/or the attachment content encoded into a Snowflake field. In our example, the table where we are downloading the attachment data is DATA_EMAIL_ATTACHMENTS.

Again, for performance reasons, it is recommended to download the data from the emails that we haven’t downloaded. For this, we will use a Query to Grid component to obtain the email IDs that we haven’t processed in DATA_EMAIL_ATTACHMENTS:

/*Query to obtain the list of Email IDs to loop through them and retrieve the attachment*/
SELECT ID
FROM "${DestinationDatabase}"."${DestinationSchema}"."DATA_EMAIL"
WHERE ID NOT IN
(SELECT ID
FROM "${DestinationDatabase}"."${DestinationSchema}"."DATA_EMAIL_ATTACHMENTS" 
GROUP BY ID)
GROUP BY 1

This ID gets mapped and used in a Grid Iterator to be used in the Email Query component to download the attachments:

It is important to note that to download attachments, it is necessary to specify in the query a specific email ID. That’s why the Grid Iterator is required to go through the IDs and save extracted data in a table (in our example, DATA_EMAIL_ATTACHMENTS). Remember to verify that the component Load Options has set Recreate Target Table to Off.

The query to be used in the Email Query component is:

SELECT [Id],[AttachmentData]  FROM INBOX 
WHERE [Id] = ${v_elt_MaxID}
AND DownloadAttachments = 'true'
AND AttachmentFolder = '/data/emailattachments/'

You will need to create the folder in your Matillion box as required. In this example, it is ‘/data/emailattachments/’.

Pro Tip: If you don’t need to download the physical file and just need the encoded string in Snowflake table, simply delete the clause “AND AttachmentFolder = ‘/data/emailattachments/’“ in the email query; if you are using this clause, the table will show the string ‘Saved to: /data/emailattachments/’. The field AttachmentData in DATA_EMAIL_ATTACHMENTS will be encoded. You can use the Snowflake function TRY_BASE64_DECODE_STRING( [, ]) to decode it to plain text.

This is useful to be used directly on the table if the attachment is a small single text file (tab, comma or text separated) and no further processing is required:

SELECT 
D.ID,
Y.VALUE::VARCHAR() as CONTENTS
FROM {DestinationDatabase}.{DestinationSchema}."DATA_EMAIL_ATTACHMENTS" D,
LATERAL FLATTEN (INPUT => SPLIT(try_base64_decode_string(D.ATTACHMENTDATA) , '\n')) Y;

Move the Files to Cloud Storage

The last step will be to move your downloaded attachments to a cloud storage location such an S3 bucket for further processing. In this example, we only want to further process the attachments that come from a specific email address. For this, we will use a Grid to Query component in combination with a Query Iterator to loop through the name of the files that we require to move:

The Email Conditional component will have the conditions of the emails you want to further process:

SELECT ATTACHMENTS FROM 
${DestinationDatabase}."EMAIL"."DATA_EMAIL_FLATATTACHMENTS" ft
WHERE "FROM" = 'vip_emails@important.com'

The attachment emails will be passed to the variable ${jv_curr_attachment} and later used in the Python component that moves to S3:

import boto3
import os


###########
# CONFIGS #
###########
# AWS Simple Storage Service client
s3 = boto3.client('s3')


def main():


    os.chdir('/data/emailattachments/')
    currfile = '${jv_curr_attachment}'
    
    s3.upload_file(currfile,'mybucket','myfolder/'+currfile)




if __name__ == '__main__':
    main()

From here, you can further process your files as required.

Voila! There it is. Now, with Matillion and Snowflake, you can automate the dreaded email attachments. Happy pipeline building! If you have any further questions, please do not hesitate to reach out to myself or the InterWorks team! We’d love to help.

The post How to Stage Email Attachment Data with Matillion appeared first on InterWorks.

DataWomen Event Recap: Work After COVID

Azucena Coronel — Thu, 03 Jun 2021 19:14:27 +0000

Earlier this month, we ran the second-ever panel for DataWomen. DataWomen is a data network that hosts regular webinars, runs a mentoring program, hosts a LinkedIn group and shares valuable content and resources to bolster the broader data community.

The aim of all of this is to support and empower women in data – DataWomen – and data friends by creating an inclusive community for discussion and mutual learning. You can learn more about this at the DataWomen page.

The Workplace After COVID

COVID seems like it’s gone on forever, right? Many of us had our lives, personal and work, turned upside-down. So what does life and work look like during COVID and in a post-COVID world? In the most recent event, InterWorks Analytics Consultant Beth Kairys hosted Jessica Paulson from Equal Opportunity Schools and Candra McRae from Tableau for a candid conversation on life and work. The discussion between these phenomenal women was so insightful, it was hard to pick only a few highlights, but we tried anyway. Here were some of our favourite moments:

The importance of really turning off and leaving work at the end of the day, whether by having a completely separate workspace or through the simple act of placing your laptop in a drawer when you’re done for the day
The role of routine in keeping you on track and balanced, including just taking a lunch break (something we’re all guilty of missing from time to time!)
The need to make time to build social capital with colleagues in a remote world to ensure we’re fostering trust, a key foundation of great working relationships
Adjusting expectations around what is doable, or required, and outsourcing things where we can
An analogy for taking on and letting go of responsibility: knowing which balls you can drop – specifically, knowing which are plastic and which are glass and letting the plastic ones go

We could have talked for hours and were thrilled to see the level of engagement from the community. We’re hopeful the event gave you some food for thought! We know we left the panel inspired to put some of these tips into practice, whether it’s making time for pilates and yoga breaks or taking a walking meeting rather than Zoom.

If you missed the session, check out the recording below:

Be a Part of the DataWomen Community

Part of the DataWomen community is a mentoring program. We have deep dives on topics like salary negotiations and productivity at home, plus small group chats with amazing women in our community. You can learn more about that program at our landing page and find upcoming sessions. Be sure to join our LinkedIn group to get plugged into the DataWomen network right away and be the first to know about what’s next!

Want to get involved ASAP? We’re always on the lookout for stellar individuals to contribute to DataWomen and help it thrive. If that sounds like you or someone you know, reach out to Azucena Coronel, Beth Kairys or Debbie Yu. See you next time!

The post DataWomen Event Recap: Work After COVID appeared first on InterWorks.