Barbara Hartmann | InterWorks https://interworks.com/people/barbara-hartmann/ The Way People Meet Tech Tue, 21 Jan 2025 20:54:21 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 Demystifying the EU AI Act https://interworks.com/blog/2024/12/10/demystifying-the-eu-ai-act/ Tue, 10 Dec 2024 12:00:12 +0000 https://interworks.com/?p=65443 This summer, the European Union (EU) has released a new regulation regarding the development and use of artificial intelligence (AI) in order to protect EU citizens. The Artificial Intelligence Act (AI Act), with the official name “Regulation (EU) 2024/1689 of the European Parliament and of...

The post Demystifying the EU AI Act appeared first on InterWorks.

]]>

This summer, the European Union (EU) has released a new regulation regarding the development and use of artificial intelligence (AI) in order to protect EU citizens.

The Artificial Intelligence Act (AI Act), with the official name “Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act),” can be accessed in multiple languages and formats on the official website for EU legal documents: EUR-Lex.

With 144 pages, 13 chapters, 113 articles and 13 annexes, this document can be quite overwhelming. But don’t worry, this blog post is breaking it down a bit and will give you a basic understanding of what this is all about.

Disclaimer: The reader is responsible for ensuring compliance with all relevant laws and regulations. InterWorks does not offer legal advice nor guarantee that its services or products will ensure compliance with any laws or regulations.

Agenda

  • When Will This be Enforced? (Timeline)
  • Who Needs to Know? (Scope and Personas)
  • What is This? (Risk Categories)
  • What Needs to be Done? (Actions)
  • Why Should You Care? (Penalties)
  • Wrap Up

When Will This be Enforced? (Timeline)

Wondering if you need to act now or can wait? This section outlines key dates and deadlines.

Proposed by the Commission in April 2021 and agreed upon by the European Parliament and the Council in December 2023, the EU AI Act entered into force on August 2, 2024.

The following dates indicate when the respective regulations will take effect based on the risk category:

  • February 2025 (after 6 months)
    Prohibitions on AI practices that pose an unacceptable risk.
  • August 2025 (after 1 year)
    Regulations for General Purpose AI (GPAI) models put into market from that date on.
  • August 2026 (after 2 years)
    Regulations for high-risk AI systems as mentioned in Annex III.
  • August 2027 (after 3 years)
    Regulations for high-risk AI systems as mentioned in Annex I as well as necessary steps for providers of General Purpose AI models that were placed into market before August 2, 2025.

Who Needs to Know? (Scope and Personas)

Yes, this is legal material, but it’s essential for everyone involved in AI — including users.

Scope

The AI Act applies to AI systems used in a professional environment inside the European Union (EU), including AI systems that might be located outside of the EU but its output is used in the EU.

Exemptions for the AI Act regulations include:

  • Purely personal, non-professional activities by individuals
  • Scientific research or development
  • Military, defense or national security purposes

Personas

There are general requirements for AI systems, but most of the regulations are tailored to specific obligations depending on the parties involved, or personas. The two primary ones are:

  • Provider:
    A person that develops an AI system, develops and places it into market, or puts it into service under its own name or trademark.
  • Deployer:
    A person that uses an AI system for a professional activity, depending on the system, its use may affect other persons as well.

There are also the Importer and Distributer personas, which are distinct from the Provider, in that they have no involvement in the development of the AI system. They are only responsible for the later steps of the supply chain.

What is This? (Risk Categories)

How you can determine if your AI system falls under these regulations.

The AI Act divides artificial intelligence in categories based on the risk they might pose for the user or the population in general. These risk categories will then determine what actions need to be taken to reduce or better eliminate any harm that could be done. There are three main risk categories you need to be aware of:

1. Unacceptable Risk ❌

The absolute No-Gos. If you do one of these, read the entire article and get busy right away.

This category describes AI practices that pose an unacceptable risk and will be prohibited entirely. This shall already come into force early next year. The full list of AI practices in this category is described in chapter 2, article 5, they include:

  • Manipulate Behavior: AI Systems that exploit vulnerabilities in individuals, such as those that manipulate behavior or exploit emotional responses. For example, an AI system used to spread false or biased information to influence voter decision or incite panic or violence.
  • Social Scoring: AI systems that implement social scoring, such as those used to evaluate the trustworthiness of individuals.
  • Untargeted scraping of facial images: AI systems used to create or expand a facial recognition database by broadly scraping facial images from the internet or CCTV footage.
  • Real-Time Biometric Identification: Technologies used for real-time biometric identification in public spaces, which could infringe on privacy and civil liberties.

2. High-Risk AI 🔥

The risky, but necessary things.

High-Risk AI describes AI systems that might impose a high risk of harm to the health and safety or the fundamental rights of people and are divided into two subcategories based on their intended use-case or area. The AI Act describes this category in chapter 3, articles 6-7, and the areas are listed in annex I and III.

Annex I covers areas in which AI systems are products or are used as safety components in products, which are regulated under specified Union regulations. These products include:

  • Medical devices
  • Toys
  • Transportation, lifts and aviation

Annex III describes areas in which AI systems are implemented, these areas include:

  • Remote biometric identification systems (in case they are not already prohibited due to an unacceptable risk).
  • Critical Infrastructure like road traffic, or in the supply of water, gas, heating or electricity.
  • Education, employment, and asylum or border control management.

3. General Purpose AI Models with Systemic Risk

AI on a large scale, having a high impact and wide reach.

General Purpose AI models — for example, Large Language Models (LLMs) — will be classified as having a systemic risk when they have high impact capabilities on the EU market due to their reach, or have a potential negative effect on public health, safety, public security, fundamental rights or the society as a whole.

The impact capabilities will be assessed by the Commission based on specific factors, such as:

  • The model’s design.
  • The quality or size of its training data.
  • The computational power required for training.
    (For the techies: when the cumulative amount of computation used for its training exceeds 1025 in floating point operations.)

More details on this category can be found in chapter 5, article 51 and annex XIII.

What Needs to be Done? (Actions)

Now that you might have more clarity on how to categorize your AI system, you need to know what actions you need to take. There are numerous requirements, too many to cover fully here. I will highlight some of the most important ones, to give you a sense of the overall direction.

General Requirements for High-Risk AI Systems

The general requirements are listed in chapter 3, section 2, the following is giving you a summary of what it entails:

  • Risk management system
    The Act mandates a comprehensive risk management process to continuously identify, assess, mitigate and monitor risks throughout the system’s lifecycle.
  • Data and data governance
    The data used for developing the AI model shall be of high-quality, relevant and representative to minimize bias and ensure accuracy, with clear documentation on data sources and processing methods.
  • Technical documentation
    Technical documentation shall be drawn up to ensure transparency and enabling authorities to assess compliance with safety, accuracy and ethical standards throughout the AI system’s lifecycle.
  • Record-keeping
    The AI system shall automatically keep records and logs to ensure traceability, allowing authorities to monitor and verify its compliance with safety and transparency standards.
  • Transparency and provision of information to deployers
    The AI system shall be accompanied by clear and comprehensive information regarding its intended purpose, capabilities and limitations, along with any necessary instructions for safe and responsible use.
  • Human oversight
    The AI system shall be designed to allow effective human oversight, enabling operators to monitor and intervene in the system’s operations to prevent or minimize risks to health, safety and fundamental rights.
  • Accuracy, robustness and cybersecurity
    The AI systems shall be designed and developed to ensure high accuracy and reliability while implementing measures to protect against cybersecurity risks throughout their lifecycle.

Obligations Based on Personas

In addition to these general requirements for high-risk AI systems, there are special obligations depending on the personas involved.

  • Providers are, for instance, required to:
    • Implement a quality management system to ensure compliance with the regulation.
    • Monitor system performance after deployment and address any issues that arise to ensure compliance and safety.
  • Deployers must also ensure the following:
    • High-risk AI systems are used in accordance with instructions and information made available by the provider.
    • Effective human oversight is in place, conducted by a trained individual with adequate support.
    • Accurate records of the systems use are maintained.
    • Ongoing monitoring of the systems performance, reporting of any issues to the provider, and compliance with applicable safety and ethical standards.
  • Importers and Distributers shall, for example, ensure that:
    • The provider has drawn up the required technical documentation.
    • The product bears the required CE mark and is accompanied by the EU declaration of conformity (these two are required for any system operating within the European Economic Area (EEA) declaring that the product meets all the legal requirements)

More on these persona differentiations can be found in chapter 3, section 3 and even more regulations regarding things like notifications, certifications and CE markings are listed in chapter 3, sections 4 and 5.

Regulations for GPAI models with a Systemic Risk

Additional regulations for providers of GPAI models posing a systemic risk as outlined in chapter 5, article 55  include:

  • Model Evaluation and Testing: Conducting standardized, state-of-the-art evaluations — including adversarial testing — to identify and mitigate systemic risks.
  • Risk Assessment at Union Level: Identifying and mitigating risks that may arise across the EU from the model’s development, deployment or use.
  • Incident Tracking and Reporting: Documenting and reporting serious incidents and corrective actions promptly to the AI Office and relevant authorities.
  • Cybersecurity Measures: Ensuring robust cybersecurity for both the model and its physical infrastructure.

Specific Transparency Obligations

In addition to the risk categories, there are transparency obligations for AI systems that are directly interacting with a person and/or generating audio, image, video or text output. These shall make it clear to the user that its content is artificially generated. More details on this can be found in chapter 4, article 50.

Why Should You Care? (Penalties)

Why should we care? Could we just ignore this?

Nope, because like most laws and regulations, non-compliance comes with penalties. And given the sensitivity of this topic and the intent to protect EU citizens, these penalties are substantial, so I strongly recommend ensuring compliance.

I am focusing on the main categories here. The full text regarding all fines and penalties can be found in chapter 12 of the AI Act.

  1. Non-compliance with the prohibited AI practices (article 5)
    • Up to 35 Mill EUR.
    • Or 7% of worldwide annual turnover, whichever is higher.
  2. Non-compliance with most other regulations (e.g. for high-risk AI systems, GPAI models, or transparency obligations)
    • Up to 15 Mill EUR.
    • Or 3% of worldwide annual turnover, whichever is higher.
  3. The supply of incorrect, incomplete, or misleading information to notified bodies or national competent authorities in reply to a request
    • Up to 7.5 Mill EUR.
    • Or 1% of worldwide annual turnover, whichever is higher.

Beyond these penalties, it is crucial to be aware of the status of your AI system. For instance, cloud platforms like the Snowflake AI Data Cloud already restrict the use of their AI functionality for systems classified as prohibited or high-risk, as outlined in their Acceptable Use Policy in Chapter II D.

Wrap Up

This article offers a general overview of the AI Act, which may evolve over time. I’ve highlighted some key sections to help you understand what to focus on and to navigate the official document more easily. This document is by no means a replacement of legal consult.

I highly recommend familiarizing yourself with the official regulations and identifying any areas that might apply to you. If your product or work could potentially fall into one of the restricted categories, it’s wise to take action now to avoid fines or other penalties — and to ensure a safe environment that protects the people of the European Union.

The post Demystifying the EU AI Act appeared first on InterWorks.

]]>
Tag To Protect – Keep Your Data Safe with Object Tagging in Snowflake https://interworks.com/blog/2024/10/29/tag-to-protect-keep-your-data-safe-with-object-tagging-in-snowflake/ Tue, 29 Oct 2024 15:52:18 +0000 https://interworks.com/?p=65157 As data breaches and cyber threads continue to rise, securing your data has become a top priority for businesses and organisations across all industries. Whether you want to safeguard sensitive customer data, protect your intellectual property or need to adhere to regulatory requirements, protecting your...

The post Tag To Protect – Keep Your Data Safe with Object Tagging in Snowflake appeared first on InterWorks.

]]>

As data breaches and cyber threads continue to rise, securing your data has become a top priority for businesses and organisations across all industries. Whether you want to safeguard sensitive customer data, protect your intellectual property or need to adhere to regulatory requirements, protecting your data is critical. This is where Snowflakes Object Tagging comes into play. Snowflakes Object Tagging capabilities allow you to strengthen your data security by classifying and managing data on a granular level, ensuring that only the right people have access to the right data.

What is Object Tagging?

With Object Tagging, Snowflake provides an easy way to classify objects by attaching a string value tag to it. Tags are schema-level objects that can be attached to other Snowflake objects like columns, users, databases or even the entire account.

This attached tag can label the object to help determine its security or protection level. Most importantly, they enable organisations to control data access. You can for example label sensitive data with tags like “PII” (Personally Identifiable Information) or “Confidential” and then apply security policies that ensure only authorized users can access this information. This way, securing your data becomes easily manageable.

What Should You Know About Tags?

Tags are key-value pairs, where the value can vary for different objects. A tag named “security-level,” for example, can have possible values like “confidential,” “internal” or “public.” The maximum number of these values is 300 per tag. You can also restrict the values by defining a list of allowed values to control the usage.

Tags follow the Snowflake securable object hierarchy and are inherited, which means that a tag on a table will also be attached to all of its columns. This is referred to as tag lineage. Within this hierarchy, the inherited tag can be overwritten with a more specific value on the object lower in hierarchy. A tag with the same name set on a column will overwrite the one that was attached to the table itself.

NOTE: This does not include propagation to nested objects, which have to be recreated.

Can you have unlimited tags? Not quite, but the quotas are quite high, see for yourself:

  • The number of tags used within an account is limited to 10.000.
  • Usually, an object can have up to 50 unique tags.
    • Inside a table or view, this varies a bit — for columns, this maximum of 50 is to be considered over all columns of a table or view combined.
    • Be aware that this limit includes dropped tags (within 24 hours after dropping the tag), so best practice is to UNSET a tag from all objects it is attached to before dropping it. The reason for this is that dropped tags can be undropped within 24 hours and will then be restored with all references it had at the time it was dropped. Please refer to the Snowflake documentation about manage tag quotas for more information.

Speaking of dropping a tag: A tag can indeed be dropped even if it is assigned to an object. So before dropping it, determine all of the objects the tag is assigned to by calling the Account Usage table function “TAG_REFERENCES_WITH_LINEAGE” (more info below). An exception here are tags assigned to a masking policy. These cannot be dropped if attached to an object — they have to be unset before.

You cannot define future grants on tags.

Be aware that when you create a view or a secure view from a table that has tags attached, the tags will not be present in the view. However, if you clone an object, all tag associations in the source object (e.g. table) are maintained in the cloned objects.

Create and Apply a Tag

First you have to make sure you have the right privilege.

  • To create a tag, you need the CREATE TAG privilege and at least USAGE on the database and schema.
  • To apply a tag to an object, you need either:
    • OWNER privilege on the Object and APPLY on the tag,
    • Or your role has to be granted the ACCOUNT level (global) APPLY TAG privilege.
  • If you want to alter a tag, like attaching a masking policy, your role has to have either:
    • The global APPLY MASKING POLICY privilege and APPLY privilege on the TAG,
    • Or the global APPLY MASKING POLICY and global APPLY TAG privilege
-- create a tag
--------------------------
CREATE OR REPLACE TAG PII_TAG
  ALLOWED_VALUES 'confidential', 'internal', 'public'
  COMMENT = 'PII security level for column masking';

-- apply the tag to a column in a table
--------------------------
ALTER TABLE <table_name>
  MODIFY COLUMN <column_name>
  SET TAG PII_TAG = 'internal';

Tags for Security

Before we move on to the details about the power couple Tags & Policies, I want to point out some general benefits of using object tagging to improve the security requirements in your organization:

  1. Granular Access Control
    • Object tagging allows you to implement highly specific policies based on the applied tags. By tagging sensitive columns, you can prevent unauthorized access and increase the security standard on a granular level.
  2. Dynamic Security Policies
    • Things change, and so might the security level of your data. Using Tags, these changes are easily implemented by simply updating the tag value or attaching a new policy to an existing tag.
  3. Compliance with Data Privacy Regulations
    • In order to be on track with data privacy regulations like the GDPR (General Data Protection Regulation) in the EU, tags will help you to ensure that PII data is handled accordingly.
  4. Audit and Monitoring Capabilities
    • Tags simplify identifying sensitive data and allow you to track usage of vulnerable information based on applied tags. Please read the following blog post of one of my colleagues for more on that topic: Microchip Your Data: Snowflake Column Lineage.

How Best to Combine Tags and Policies?

Best practice is to attach a policy to a tag which can then be attached to an object. This provides the following benefits:

  • You can apply a policy to multiple objects by just adding it once to a tag.
  • You can attach multiple policies to one tag.
  • You can easily switch the policy on a tag, and all tagged objects will be protected by the new policy.
  • You can apply multiple variations of one policy, based on different data types.
  • You can apply the tag to the table and the policy automatically protects all of its columns, including new columns.

There are two general ways of utilizing tags in combination with policies:

  1. You can either use a more generic masking policy, which is just using the tag to identify securable objects
  2. A finer-grained control can be achieved when utilizing the tag to get different outcomes based on the tag value

The following code snippet is showing the first approach: it creates a masking policy and attaches a masking policy to a tag which can be applied to various objects. It even takes it one step further and includes role base access control: If the user is accessing the data with a privileged role they will be able to access the clear value, for all other roles the value in the content will be masked.

-- create a Masking Policy
--------------------------
CREATE OR REPLACE MASKING POLICY <policy_name>
   AS (val string) RETURNS string ->
      CASE
         WHEN CURRENT_ROLE() IN ('PRIVILEGED_ROLE') THEN val
         ELSE '*** MASKED VIA TAG ***'
      END;

-- attach the Masking Policy to the tag
--------------------------
ALTER TAG HIDE_FROM_READER_TAG 
   SET MASKING POLICY <policy_name>;

If you want to include different outcomes based on the value of the tag, you can include this in the policy by adding the SYSTEM$GET_TAG_ON_CURRENT_COLUMN function as the following script shows:

-- create a Masking Policy based on the Tag value 
--------------------------
CREATE OR REPLACE MASKING POLICY <policy_name> AS (val string) RETURNS string ->
   CASE
      -- highly_confidential columns will be masked
      WHEN SYSTEM$GET_TAG_ON_CURRENT_COLUMN('PII_TAG') = 'highly_confidential' 
        THEN '*** highly_confidential ***'
 
      -- public columns will get the unmasked value
      WHEN SYSTEM$GET_TAG_ON_CURRENT_COLUMN('PII_TAG') = 'public'   
        THEN val

      -- default to the masked value
      ELSE '******'
   END;

Some additional things to note of when using tags with policies:

  • Be aware that you can have only one policy per tag per data type, i.e. you can create one policy per data type and then attach each to a tag to produce different outcome based on the data type. This is useful when applying the tag to a table where then each column inherits the tag no matter what data type they are using.
  • A policy directly attached to an object will overwrite a policy that is attached via a tag.

If you want to find out more about masking your data, please also have a look at this blog post: Snowflake Masking Policies Demystified.

Monitor Tags

There are several ways to discover and monitor tags. You can of course monitor tags via Snowsight or, as I prefer, using SQL. The most commonly used commands are listed below. As always, the outcome depends on the role executing the command, if your role does not have sufficiently high privileges, you might not get the expected results. Note that the SNOWFLAKE.ACCOUNT_USAGE schema might have a delay of up to 2 hours for items to be listed. More details can be found in the Snowflake documentation on monitor tags.

SHOW_TAGS and TAGS

The SHOW_TAGS command lists all tags that exist within the database or a schema. The second one, ACCOUNT_USAGE.TAGS, is showing results on the account level:

-- show all tags within a database or schema
SHOW TAGS;
SHOW TAGS IN SCHEMA <database_name>.<schema_name>;


-- show all tags within your Snowflake account, including deleted tags
USE ROLE ACCOUNTADMIN;
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAGS ORDER BY TAG_NAME;

SYSTEM$GET_TAG

This system function will return the tag value associated with the specified object. You will need to provide the names of the tag and the object as well as the kind of object we’re looking at. This is referred to as the object domain which can be anything from account to column. The code below is using a table object.

Note that the two related tags SYSTEM$GET_TAG_ON_CURRENT_TABLE and SYSTEM$GET_TAG_ON_CURRENT_COLUMN, the latter of which was used in the masking policy above, are only available within a policy declaration and cannot be called in a SELECT query like the GET_TAG function below.

-- syntax
SYSTEM$GET_TAG( '<tag_name>' , '<obj_name>' , '<obj_domain>' );

-- select the value for tag pii_tag attached to the my_users_table
SELECT SYSTEM$GET_TAG('pii_tag', 'my_users_table', 'table');

TAG_REFERENCES

TAG_REFERENCES is available as a function in the INFORMATION_SCHEMA of each database and as a view in the ACCOUNT_USAGE schema of the SNOWFLAKE database.

The FUNCTION is bound to database level and returns a list of tags associated with the specified object, including tag lineage. The example uses the table object “my_users_table.” Possible object domains for this function include database, stage, table, column and many more.

The VIEW returns a list of all tags in the account and does not include tag lineage:

-- execute the function at database
SELECT * 
  FROM TABLE(<database_name>.INFORMATION_SCHEMA.TAG_REFERENCES('my_users_table', 'table'));

-- query the view at account level
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
  ORDER BY TAG_NAME, DOMAIN, OBJECT_ID;

TAG_REFERENCES_ALL_COLUMNS

This function is similar to TAG_REFERENCES listed above, but here, the object domain is restricted to table. It returns every tag set on every column in a given table or view, whether the tag is directly assigned to a column or through tag lineage:

-- syntax
TAG_REFERENCES_ALL_COLUMNS('<object_name>', '<object_domain>')

-- list all tags applied to a specific table
SELECT * FROM TABLE(INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS('my_db.my_schema.my_table', 'table'));

TAG_REFERENCES_WITH_LINEAGE

This function is different from the ones above as it is applied on a tag and returns a list of all associations between this tag and the Snowflake object to which it is attached to. As the name suggests, this function also includes tag lineage:

-- syntax
TAG_REFERENCES_WITH_LINEAGE('<name>')

-- list all occurences of a tag
SELECT * FROM TABLE(SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES_WITH_LINEAGE('my_db.my_schema.my_tag'));

POLICY_REFERENCES

To go one step further, you can even monitor your tags outside of Snowflake. Most data cataloguing tools provide functionality to monitor and import Snowflake meta data. This combination makes your data governance even more powerful, as you can, for example, centralize governance tasks more easily and provide direct access for qualified users to submit requests for tags or even handle masking themselves.

Two great examples of this combination are Atlan and Informatica, they provide native Snowflake integration which enables you to directly import your Snowflake tags into the data catalog. It would be beyond the scope of this post to go much deeper on this topic, but I wanted to at least mention this great option. If you want to read more about governance and data catalogues, please feel free to have a look at these two blog posts: What Is Data Governance? and De-Mystifying the Data Catalog: Find and Use Your Data.

Wrap up

I hope that with this blog post you have enough to get started and will be able to use Object Tagging to secure your data in no time. If you want to go on and dive deeper into this topic, here are the links to the relevant Snowflake documentation:

The post Tag To Protect – Keep Your Data Safe with Object Tagging in Snowflake appeared first on InterWorks.

]]>
Snowflake Row Access Policies Demystified https://interworks.com/blog/2024/01/30/snowflake-row-access-policies-demystified/ Tue, 30 Jan 2024 16:09:20 +0000 https://interworks.com/?p=57208 Row Access Policies are a convenient way to apply row-level security to multiple tables or views in one go. You simply create one policy and can apply that to as many objects as you like. Best of all, if you need to make adjustments, you...

The post Snowflake Row Access Policies Demystified appeared first on InterWorks.

]]>

Row Access Policies are a convenient way to apply row-level security to multiple tables or views in one go. You simply create one policy and can apply that to as many objects as you like. Best of all, if you need to make adjustments, you only need to make them in one place.

Row-Level Security

But first, a quick introduction to Row-level Security (RLS). In RLS, you grant access to specific records according to previously defined restrictions. These restrictions either depend on a Context Function or a Condition Expression Function.

  • Context Functions are provided by Snowflake and give access to the context in which the statement is executed. A popular function to use for RLS is IS_ROLE_IN_SESSION(). This is returning the active role in the current session and is used to restrict access based on the role with which the user is querying the table.
  • Condition Expression Functions like CASE() are based on logical operations and are usually combined with a mapping table which provides details about the access restrictions.

The result of these functions is a BOOLEAN value, where TRUE simply means “show this row” and FALSE “do NOT show this row.” One common use-case is the restricted access based on regional data, which I will show in my example below.

Privileges and How to Manage Access Policies

There are three privileges to be aware of when working with row access policies:

  • CREATE a new row access policy.
  • APPLY, enables a role to add and drop policies to a table or view.
  • OWNERSHIP. Only one role can hold this privilege. It grants full control over the policy and is required to alter most properties of a row access policy.

A combination of privileges might be necessary to add policies. The APPLY privilege, for example, can only be executed by a role which also holds the OWNERSHIP of the database object to be secured, unless it is granted on the account level. For more detailed information, please refer to the Snowflake documentation on Row access policy DDL.

To manage access policies, there are several governance approaches with which the responsibilities can be divided between teams to support segregation of duties. The following table outlines the change in responsibilities for a centralized, hybrid or decentralized governance architecture.

Centralized Hybrid Decentralized
Create policies Governance team Governance team
Developer team
Apply policies to columns Governance team Developer team Developer team

How to Implement Basic Row Access Policy

Now I will give you an example on how to create and apply a row access policy. This policy is based on a combination of the above mentioned context and conditional expression functions and is depending on the role and a mapping table. As mentioned above this policy will manage access based on regional information.

Step 1: Grant Privileges

First, we need to make sure the necessary privileges are in place. For a centralized approach, all is handled by the governance team. For the hybrid approach, this will be divided between the governance and the developer team. The code examples will show both approaches:

-- a governance role will have the ability to create row access policies
USE ROLE securityadmin;
GRANT CREATE ROW ACCESS POLICY ON SCHEMA <schema_name> TO ROLE <governance_role>;
-- for a centralized approach:
-- a governance role will also have the ability to apply a policy
GRANT APPLY ROW ACCESS POLICY ON ACCOUNT TO ROLE <governance_role>;

Step 2: Create a Mapping Table

We then need a mapping table to define which role can see which data records. For this example, we will provide data for three regional managers who will only be able to see records from their respective regions.

-- create a mapping table
CREATE TABLE sales_manager_regions (
  sales_manager varchar,
  sales_region varchar
);

Create a Mapping Table Snowflake

Step 3: Create the Row Access Policy

Now to the main part: creating the policy. This is done by the governance team. Access is granted based on sales manager regions by referencing to a mapping table and the currently used role. In other words, if the sales manager for region US is selecting from the table, they only get the rows from region US and cannot see any other regions.

In line 5, we provide the signature and return value of the policy. The signature contains one or more attributes that must be considered to determine whether the row is accessible or not. These are the columns to which the policy will be attached to in the next step. Since we can attach the policy to multiple tables or views, the name of the attribute can differ from the column name, but the datatype has to match. The return value will always be BOOLEAN as this is the above mentioned equivalent to TRUE “show this row” and FALSE “do NOT show this row.”

Starting in line 6, we have the CASE statement which provides the functionality of the policy. Lines 9 and 10 give full access to one specific role. Here the global manager can see all records. From line 13 on, we match the records from the mapping table with the current role and the value of the region column from line 5.

In line 21, we add the default return value with FALSE which is considered a best practice to catch cases where we have no match.

-- create a simple row-level policy using a mapping table
USE <governance_role>
CREATE OR REPLACE ROW ACCESS POLICY <rls_policy_name> 
AS (region varchar) RETURNS BOOLEAN -> 
CASE

  -- provide full access to the global manager
  WHEN ( IS_ROLE_IN_SESSION('<global_manager_role>'))
  THEN TRUE
  
  -- provide access based on the mapping of role and region
  WHEN EXISTS (
    SELECT 1 FROM sales_manager_regions
      WHERE IS_ROLE_IN_SESSION(sales_manager)
      AND region LIKE sales_region
    )
  THEN TRUE
  
  -- always default deny
  ELSE FALSE

END;

Step 4: Apply the Row Access Policy

The last step will be to apply the policy to a table. For the centralized governance approach, this is also handled by the governance team. With the hybrid approach, the developer team which is also the owner of the table will have the privilege to apply the policy. Remember the necessary permissions for this command was either APPLY ROW ACCESS POLICY on the account level or a combination of APPLY for the specific policy and OWNERSHIP of the protected object.

-- for the hybrid governance approach:
-- a developer role will have the ability to apply the policy
GRANT APPLY ON ROW ACCESS POLICY <rls_policy_name> TO ROLE <developer_role>;
-- apply the policy to a table or a view
USE <role_with_enough_privilege>
ALTER TABLE <table_name>
ADD ROW ACCESS POLICY <rls_policy_name> ON (<column_name>);

Step 5: Check the Result

The global manager now has full access to all regions:

Check results

The UK manager for example can only see records from the UK region:

UK region only

How to Check for Existing Policies

Now, how do we know which policies exist or to which object they are attached?

To check which policies exist in the schema you can simply list them with SHOW ROW ACCESS POLICIES;. You can also add a scope to adjust the search via SHOW ROW ACCESS POLICIES IN DATABASE; or even SHOW ROW ACCESS POLICIES IN ACCOUNT;. But be aware that you will only see policies you have access to, i.e. you have either OWNERSHIP or APPLY privileges to it.

For detailed information about one policy, you can use DESCRIBE ROW ACCESS POLICY <policy_name>; to see the signature and the functionality used. Or query the policy reference table function in the information schema with SELECT * FROM TABLE(<db_name>.INFORMATION_SCHEMA.POLICY_REFERENCES(POLICY_NAME => 'my_policy')); to see where the policy has been attached to. You can also see this in the table or view definition.

To check on a higher level, you can use the Snowflake database with SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.ROW_ACCESS_POLICIES; to list all row access policies in the account and SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.POLICY_REFERENCES; to see to which objects they are applied to. For the second query, it might take up to two hours to be visible.

One useful command in this context is the EXECUTE USING query. If you are using the CURRENT_ROLE() in the policy, you can emulate a specific role to test if the policy is working as it should.

EXECUTE USING POLICY_CONTEXT(CURRENT_ROLE => <test_role>) AS SELECT * FROM <table_with_RLS>;

Things to Consider When Working with RLS

  • RLS can only be assigned to Tables and Views.
  • At Runtime Snowflake creates a dynamic secure view of the database object, applies the RLS policy and shows what evaluates to TRUE.
  • RLS is evaluated before column-level security (column masking).
  • A given table or view column can be specified in either a row access policy signature or a masking policy signature. In other words, the same column cannot be specified in both a row access policy signature and a masking policy signature at the same time.

To speed up the performance when using row-level policies or RLS in general, you should be aware of the following points:

  • Reference as few columns as possible. This produces better runtime as all referenced columns need to be scanned.
  • Try to use simple SQL queries. A simple CASE statement may be better than a mapping or lookup table.
  • Subqueries in a policy may cause errors.
  • Clustering by filter attributes can speed up the performance.
  • Statistics like COUNT() or MAX() will be slower on tables with RLS in place.

I hope this blog post will help you implementing RLS using row access policies. For a more detailed look, please refer to the official Snowflake Documentation:

If you are also interested in column-level security, please check out the recent blog post of my colleague Austin Schwinn. He is describing masking policies, which use a similar approach: Snowflake Masking Policies Demystified.

If you want to learn more or need help with Snowflake or your data pipelines, please reach out to us at InterWorks.

The post Snowflake Row Access Policies Demystified appeared first on InterWorks.

]]>
Demystifying Variables in Matillion ETL https://interworks.com/blog/2023/10/31/demystifying-variables-in-matillion-etl/ Tue, 31 Oct 2023 16:48:52 +0000 https://interworks.com/?p=55478 Demystifying Variables in Matillion ETLWhat Are Variables? If you have ever done any programming, and since you are reading this blog post you most likely have, you will be familiar with the concept of variables. They are these little helpers in which you can store almost anything, except any...

The post Demystifying Variables in Matillion ETL appeared first on InterWorks.

]]>
Demystifying Variables in Matillion ETL

What Are Variables?

If you have ever done any programming, and since you are reading this blog post you most likely have, you will be familiar with the concept of variables. They are these little helpers in which you can store almost anything, except any physical object (you cannot store your favorite book, but its name). You can give it a nice (meaningful) name and retrieve it wherever and whenever you need it.

Variables are a very neat concept and without them programming would be pretty difficult to implement and extremely inflexible.

Can We Use This in Matillion ETL (METL)?

Of course we can. Matillion has done a great job in providing different variable types for specific use cases, making it super easy to implement. OK, you have to get familiar with these types and how to use them, but it is pretty straight forward. This blog post will provide you with a general overview and hopefully enough information to get started.

Different Variable Types in METL

Environment Variables

Environment variables are of a global scope as they are accessible throughout your entire METL client. You can store values that can be used by multiple jobs, like the name of your database and schemas, specific connection strings or project names.

Job Variables

Job Variables are defined within individual jobs. They hold a scalar value and will mostly be used in a local, job-wide scope. However, they can be handed over to other jobs, which will be explained a bit later.

Grid Variables

Grid Variables are special in that they can hold a whole set of data, like a table. They are also only available within the scope of a specific job, but they can be provided for other jobs like the job variables.

Automatic Variables

Automatic variables are, hence the name, automatically available after setting up your METL project. They hold information about your environment, project or job, like the name and id for example. Matillion is providing a full list of the available Automatic variables and how to edit them.

How to Create Variables in METL

Environment Variables are available for the entire project. You can create them in the Manage Environment Variables dialog from the Project Menu:

Matillion: Manage Environment Variables

Job and Grid Variables are defined for each job, so they will be created in the Manage Job Variables or Manage Grid Variables dialogs. You can open these by right clicking on the job in the explorer menu or by right-clicking the canvas inside the open job.

Matillion: Job and Grid Variables

There are some settings that are common to all variables:

  • Name: Give your variable a meaningful name.
  • Type: This refers to the data type in which the variable is stored.
    You can choose between Text, Numeric, DateTime and Data Structure.
  • Behaviour: This is referring to the way a variable is updated when used in multiple branches; for example, in a control structure or an iterator. You can choose between copied or shared. A copied variable will always start with the default value and keep any updates within the scope of the current branch. Each branch will get its own copy of the variable. A shared variable, on the other hand, will use any updates made throughout the entire job. It is shared over all branches.
  • Value: Here you can set the default value for the variable.
  • Description: An optional description, this can be very valuable for documentation.

Now to the specific flavours of each variable type. The Manage Environment Variables dialog will have additional options depending on your environment setting. Each of your environments will have its own column in which you can specify the value that is used according to the currently selected environment. For example, if you have two environments named dev and test, you will have two additional columns also named dev and test.

The dialogs for Job and Grid Variables will provide also an additional option:

  • Visibility: This is referring to the scope of the variable. You can choose between Private and Public. When set to Private, the variable is limited to the current job, meaning it can not be seen in other jobs when using the Run Orchestration or Run Transformation components. Using the Public option, you can hand them over between jobs which will be explained later in this post.

Matillion: Manage Job Variables (Visibility)

For Grid Variables you have two more sections:

  • Columns: Here you can define the Name and the Type for each column.
  • Values: In this section you can provide default values for the created column.

In the animation below we create a Grid Variable named “test_grid” with two columns for “name” and “value”. These columns are populated with default values:

Matillion: "test_grid" Variable

You also might have noticed the little check box labeled Text Mode. Text Mode is a pretty useful feature in which you can switch to a CSV style, tab-delimited view. In this view, instead of adding one variable at a time, you can copy and paste multiple variables in one go.

How to Update Variables in METL

Once a variable is created, it can be updated in many ways, like using the Iterator components. There are different types of iterators available, like the Fixed Iterator or the Loop Iterator. You will provide the variable and it can be updated with every iteration.

In the Python component, you can use context.updateVariable('variable name', 'new value') to update the value of the job variable. For Grid variables, this can be achieved using context.updateGridVariable('<GridName>', <values>) and providing an array as the value. A variable can be retrieved simply by using ${variable name}.

Variables can also be updated by directly storing query results during runtime. For job variables, this could be done with the Query Result To Scalar component, which should return a scalar value. Grid variables can be updated with the Query Result To Grid component, which is equivalent to the previous case but returns an array. Another option is to query and store metadata using the Table Metadata To Grid component.

There are some great blog posts from my colleagues, here at InterWorks, that go into more detail about using some of the mentioned components. There are two posts by Chris Hastie:

  1. The first one is about How to Reorder the Values of a Matillion Grid Variable, which is done with a Python component and can be useful after loading values with the Query Result To Grid component.
  2. The second one is about How to Combine Values of a Matillion Grid Variable into a Scalar Variable, where he shows how to use contents of a Grid Variable inside an SQL component.

And please check out Load Multiple Tables into Snowflake with Two Matillion Tools by Scott Perry (Spoiler: He is using an Iterator component).

How to Pass Variables to Other Jobs

Using variables inside of one job is great, but using them in an entire pipeline and passing them from one job to another (and back) is even better. Remember the Private and Public Visibility that was mentioned earlier? Now this will be important. When calling other METL jobs with the Run Orchestration or Run Transformation components, you can define which variables should be provided for the called job. This requires, that the variable is declared as Public and that it is created in both jobs.

But from the beginning: In the Run Orchestration component properties you have the option to Set Scalar Variables and Set Grid Variables. In the respective dialog that opens, you can select any variable that is declared in the called job. For the value that is passed over from the main job, you could either provide just a plain fixed value or — for the purpose of passing a variable — you provide a reference to a variable from the current calling job. To keep track, it is recommended to use the same names for the variable that is passed along:

Matillion: Run Orchestration

In order to retrieve an updated variable back from a called job, it is important that you declare it in the Export tab in the Properties panel.

This way you can also provide the variable(s) through an entire pipeline, i.e. over multiple jobs, possibly making changes along the way.

Some Special Use Cases for METL Variables

One great way to speed up and simplify your data transformation process is to use grid variables in your various transformation steps. This has the advantage, that you can concentrate all necessary updates and changes to one main job in your pipeline. For example, if a new field needs to be added from your data source, instead of having to document all jobs and components which need to be updated, you can simply define that grid variables x, y and z need to be adjusted in the main job.

You can then pass them over to the jobs that do the magic and don’t have to touch these other jobs in the pipeline at all. Next are some examples from a recent project I worked on.

For column names changes I used the Rename component using a grid variable to provide the necessary Source Column and Target Column.

For the various transformations that were necessary I provided a grid variable to the Calculator component. Instead of the complex (and, in my opinion, slightly intimidating) calculator dialog, you can just create a grid variable with two columns, one for the calculation and one for the output column. For calculations, you can use all of the available functions, for example:

  • String manipulations like REPLACE(REPLACE(TRIM("item price in EURO", ' €'), '.', ''), ',', '.') to convert European style currency to a usable number format.
  • Case statements like CASE WHEN "item description > 12 month" = 'Ja' THEN 'yes' WHEN "item description > 12 month" = 'Nein' THEN 'no' ELSE 'no' END to prepare the value to be converted into a proper Boolean format.
  • Date calculations like TRY_TO_DATE("lead date", 'DD.MM.YYYY') to handle different date formats from various sources.

Matillion: Manage Grid Variables

Matillon: Calculations

After these transformations I added the Convert Type component to convert all manipulated columns into the correct datatype. That grid variable needs to have columns for Column, Type, Size and Precision.

In the end, we have three grid variables that will be maintained in the main job and are passed through to whichever job that needs them. I was able to easily document all the changes in an external list just by copying the grid variables into an Excel file. This way the current status is also available outside of METL, which I find very useful when the instance is not running 24/7.

There are of course many more use cases like this, Just keep an eye open for that little “Use Grid Variable” toggle.

Wrap Up

I hope this introductory blog was helpful in demystifying those METL variables. And please stay tuned as there might be a blog about how variables are handled in the all new Matillion Data Productivity Cloud coming soon. In the meantime, check out what Fadi Al Rayes has to say about it: Getting SaaSy with Matillion’s Data Productivity Cloud.

If you want to learn more or need help with building or maintaining your pipelines, please reach out to us at InterWorks.

The post Demystifying Variables in Matillion ETL appeared first on InterWorks.

]]>