Chaitanya Joshi | InterWorks

Python and Snowflake FAQ

Chaitanya Joshi — Tue, 25 Jun 2024 16:27:37 +0000

Python is a first class language in the Snowflake eco system, and there are multiple ways to use Python with Snowflake:

Snowflake Connector for Python: Listed in Snowflake drivers, this is a standard connector to Snowflake written using Python. It supports developing applications using the Python Database API v2 specification (PEP-249) which include familiar objects like connection and cursors.
Newer Snowflake Python API: Although Snowflake Connector for Python supports the Python Database API specification, and it’s good for querying and performing DDL/DML operations, it’s not a “pythonic” way to interact with Snowflake, as every interaction with Snowflake must be written in SQL. The newer Snowflake Python API aims to solve that. It is intended to provide comprehensive APIs for interacting with Snowflake resources across data engineering, Snowpark, Snowpark ML and application workloads using a first-class Python API.
Lastly, Snowpark API for Python: Snowpark API aims to solve what the first two APIs cannot – querying and processing data at scale inside Snowflake without moving it. To be fair, Snowpark is available for three languages – Java, Scala and Python.

It is important to understand the strengths and differences with each of the Python APIs. The conventional Python connector for Snowflake is a minimal library where one cannot create dedicated Python objects for corresponding Snowflake objects, and everything is communicated through SQL. It’s not possible to have an object for a database for example. It will be just queries and results.

But newer Snowflake functionality includes many more exciting features and objects like tasks or ML models or compute pools etc. Won’t it be great to have an instance of a particular object in Python code itself and then to work with it in a standard pythonic way? That’s the gap which the newer library tries to fill. It provides native Python objects for every Snowflake constructs.

While working with any of the above libraries, it’s important to understand that both these libraries are pulling the data out of Snowflake and processing it wherever the code is running. If it’s a data intensive use case, then a lot of data is leaving Snowflake and being processed locally, which may not yield the optimum result. Snowpark is designed to circumvent that. Instead of bringing data to the code, code is brought to the data.

1. Advantages of Python within Snowflake

When citing advantages of Python within Snowflake, it is safe to assume that the API library in question is the Snowpark API library. Even Snowpark has two different execution modes. One can write Snowpark code as a local application by importing Snowpark library in their own Python application. Alternatively, code can be directly written using native Snowflake objects like user defined (tabular) functions (UDFs or UDTFs) or stored procedures (SPs) in Snowflake. In the client side implementation using Snowpark, all the code except the Snowpark code is run on the client side, whereas the Snowpark code is run on the serverless compute resources in Snowflake through a session. When Snowpark is used to create a UD(T)F or a SP directly in Snowflake, the entire code is both stored and run in Snowflake.

In both cases, the application benefits from the scalability of Snowflake and reduced data transfer due to the execution of data related code directly in Snowflake.

Moreover, the Python UD(T)Fs and SPs created using Snowpark can be called directly from functions and procedures written in other Snowpark supported languages or even in a SQL query. Snowflake will automatically handle the interchange of data and types.

2. Can Python be used without Snowpark?

Yes. One can use the traditional or the newer Python libraries for Snowflake to interact with Snowflake without using Snowpark.

3. Difference between using Snowpark and Anaconda. Does one have an advantage over the other? Can one be used over the other? What are the pros and cons?

Snowpark and Anaconda are not directly comparable. Snowpark is a framework and one of the framework implementations is a Python library for Snowpark. Anaconda, on the other hand, is a distribution of Python. A distribution of Python includes the Python interpreter, runtime and system libraries. Along with that, Anaconda is also a Python package repository which hosts additional packages for Python. Unlike the well-known Python package index PyPi, Anaconda has different acceptance criteria for packages to be included in their package repository.

The key way in which Snowpark and Anaconda are related is through Anaconda’s Snowflake Channel. This channel hosts all the third-party packages that are natively available to Snowflake UD(T)Fs and Stored Procedures in a Snowpark Python environment. The channel can also be used with a local Anaconda installation to set up a local Python environment that matches Snowflake’s own internal environment, which is useful when developing or testing processes locally before pushing them up to Snowflake. More information here.

4. What packages are available in Snowpark, and can you turn them off?

To find out what packages are available in Snowpark, check out the Snowflake channel for Anaconda: Snowflake Snowpark for Python (anaconda.com). From command line client Snowflake CLI, use package lookup command.

To turn off certain packages, use packages policies.

5. Difference of using Python on-prem over Snowpark

The differences are same as using the standard Snowflake Python libraries mentioned above.

6. What controls (security and otherwise) does Snowpark have so that people don’t download malicious packages for Python?

Packages policies can be created and applied account wide in Snowflake to specify exactly which third party packages can be used with Snowpark. Besides that, Snowpark Python has inherent security restrictions that connecting to outside host is not permitted without creating special integration objects first. Also, writing to permanent file systems outside of Snowflake stages is prohibited (Security practices).

An implementation example could be found in Snowflake External Access: Trigger Matillion DPC Pipelines.

7. Any other IT security concerns with Snowpark?

Snowpark runs on a restricted Python runtime which is further secured by Snowflake’s access control model and base infrastructure. It’s much more secure to run Python code in Snowflake than to run outside of it.

Converting Geospatial Coordinate Encoding in Snowflake

Chaitanya Joshi — Tue, 29 Aug 2023 16:16:52 +0000

Recently, I was involved in a project which required using geospatial (basically maps) data for an analytical application. Now, with ESRI shape files being around for more than three decades, any sincere BI application like Tableau can easily read and work with these spatial files. But Snowflake has geospatial data types available, too; and when there is a need to combine the geospatial data with other data or to process it in a performant manner, Snowflake has the edge rather than storing and referring these files locally. Besides, recently Snowflake has made reading files dynamically in a Python Snowpark function possible. Daria Rostovtseva from Snowflake has written a crisp article about loading shape file archives using Python Snowpark UDTF. So, I thought I would make use of it right away and make life (and analysis) simpler. Little did I know that I had embarked on a small adventure.

Loading the Shape Data

As the project required UK geographical data, my colleague Max provided me with some excellent publicly available datasets from 2011 UK census. Following the above article, it is fairly straightforward to load the data. The entire shape file ZIP archive is required to decipher the data and not just the .shp file. Simply create an internal stage, load the shape data archive there and then use the Snowpark Python table function py_load_geofile given in blog to load the data into a table.

create int_stage;
put file://infuse_dist_lyr_2011.zip @int_stage/shape_files parallel = 8 auto_compress = false source_compression = auto_detect;

create or replace table local_area_districts_staged as
select 
    properties,
    to_geography(wkb, True) as geo_features
from table(
    py_load_geofile(
        build_scoped_file_url(@int_stage, 'shape_files/infuse_dist_lyr_2011.zip')
                                        , 'infuse_dist_lyr_2011.shp'
    )
);

It loaded uneventfully; but when referred in Tableau, Tableau didn’t like it, although the data types were what they should have been.

The first suspect for a data engineer is always the data. Snowflake provides a neat function, st_isvalid, to check whether the underlying geographical data is valid or not.

select st_isvalid(geography)
from local_area_districts_staged;

To my shock, every value in the geography column was invalid!

But in this case, data couldn’t be doubted as it was officially published governmental data. It left me thinking, and when I started looking for clues, I came across this statement in Daria’s blog which I had overlooked before:

It’s important to be aware of the shapefile’s spatial reference. If geographic coordinates are unprojected (i.e., representing locations on the Earth’s surface as points on a 3D sphere using WGS84 geographic coordinate system), it is appropriate to parse the geographic coordinates into Snowflake’s geography data type. In my example, the data is in WGS84, and parsed using to_geography function. If the coordinates are projected (representing locations on the Earth’s surface using a two-dimensional Cartesian coordinate system), they should be parsed into geometry data type using one of the geometry functions. To figure out if the shapefile is projected or unprojected, you can look at the .prj file inside the zipped shapefile archive.

The CTAS statement to import the data was thus updated to load a column of type geometry instead of geography as follows.

create or replace table local_area_districts_staged as
select 
    properties,
    to_geometry(wkb, False) as geo_features
from table(
    py_load_geofile(
        build_scoped_file_url(@int_stage, 'shape_files/infuse_dist_lyr_2011.zip')
                                        , 'infuse_dist_lyr_2011.shp'
    )
);

Please note the call to the to_geometry function. The value of the second parameter, , had now been set to False to perform stricter loading and to reject invalid shapes. Keen readers would notice that in the first version of the CTAS statement (see above), this parameter value to to_geography function was True, which is why Snowflake accepted the invalid geographical coordinates in first place. However, this was far from a solution as Tableau would still require the geographical coordinates in longitude/latitude format.

Finding and Converting the Coordinate Encoding

My knowledge about geospatial data and systems being less than elementary, that was a real puzzle for me now. But a close look at some data points (for example, (291902, 196289)) told me that these are not longitude/latitude data! WSG84 meant longitude/latitude data. Now a word (or a paragraph rather!) on geographical coordinate systems is necessary here. WGS or World Geodetic System is a “standard used in cartography, geodesy and satellite navigation including GPS,” according to Wikipedia, and 84 is the current version of the standard so the longitude/latitude numbers in layman’s terms become WGS84 encoded coordinates. The .prj file in the data archive had some really cryptic information out of which following seemed interesting. “OSGB_1936_British_National_Grid.“

Using a Built-in Snowflake Function

Fortunately, Snowflake has recently released a st_transform function into GA. With this function, it is easy to convert the spatial reference system of any coordinates into any other system. When the “from” and “to” SRID of the coordinate systems are known, the conversion is simply a function call on the column which contains the geospatial data.

ST_TRANSFORM(  [ ,  ] ,  );

A simple search for the SRIDs for OSGB_1936_British_National_Grid and WGS84 showed that their SRIDs are 27700 and 4326 respectively (WGS84 has many coordinate systems, but one that is relevant to it, is 4326, as per the Snowflake documentation.). That means the conversion could be simply:

select st_transform(geometry, 27700, 4326) from local_area_districts_staged

But there is one things to keep in mind. The st_transform function returns GEOMETRY data type and not GEOGRAPHY data type. So the final table creation statement became,

create or replace table local_area_districts
as
select properties, try_to_geography(st_aswkb(st_transform(geometry, 27700, 4326))) as geography
from local_area_districts_staged;

A quick test with Tableau showed that Tableau was happy about this data and could read it as geospatial map data (screenshot similar to one in the end of this post).

Now, this could very well be the end of this post as the goal was achieved, but something was still on my mind: The .prj file that came with the data, which I previously mentioned, had much more descriptive content. Following to be precise.

PROJCS[
  "OSGB_1936_British_National_Grid",
  GEOGCS["GCS_OSGB 1936",
    DATUM[
      "D_OSGB_1936",
      SPHEROID["Airy_1830",6377563.396,299.3249646]
    ],
    PRIMEM["Greenwich",0],
    UNIT["Degree",0.017453292519943295]
  ],
  PROJECTION["Transverse_Mercator"],
  PARAMETER["latitude_of_origin",49],
  PARAMETER["central_meridian",-2],
  PARAMETER["scale_factor",0.9996012717],
  PARAMETER["false_easting",400000],
  PARAMETER["false_northing",-100000],
  UNIT["Meter",1]
]

There was a lot of information there for which the st_transform function had no regard. In all fairness, it did solve my purpose, but what if all this extra information is required to achieve projection or conversion accuracy? How to use this information as there is no way to provide this information to st_transform function. That time it clicked me, Python has solution for everything! If I could find a pythonic way of coordinate conversion, could Snowpark help me?

Using Snowpark Python UDF to Convert Coordinates

pyproj Transformer

The brainwave I got was not going to make me a geoinformatics expert in one day, but thankfully these days we have a very powerful assistance available. I sought help of ChatGPT.

The biggest help from ChatGPT was determining which Python package to use and the boilerplate code. Credit should be given where it is due, but although it was a great help to set me on the right course, it was by no means a simple copy and paste work. A lot more needed to be done in order to upgrade this code into “production” grade code. It’s also worth noting that British Geological Survey provides a webservice as well as a detailed SQL process for bulk conversion of coordinates. Both of which were unsuitable for this use case due to various factors like dependencies and scale. A native Python Snowpark UDF was better choice and the crux of this new solution therefore remained the conversion function convert_bng_to_latlon provided (and so beautifully drafted) by ChatGPT.

The transformer.transform function suggested by ChatGPT has been deprecated since 2.6.1. version of pyproj library. The newest version (3.6.0 as of writing this article) has class named Transformer, which is recommended for all these conversion tasks. There are multiple ways to create object of Transformer class. One can simply specify the EPSG string or create a pyproj.crs.CRS object from elaborate settings like given above.

That means, however, the following code is sufficient and would work:

from pyproj import Transformer

transformer = Transformer.from_crs(
  "EPSG:27700" # British National Grid (OSGB)
  , "EPSG:4326" # WGS84 (latitude and longitude)
)

This next way of creating a Transformer object is much better way to preserve all the characteristics of the source CRS:

from pyproj import Transformer, CRS

crs_str = '''
PROJCS[
  "OSGB_1936_British_National_Grid"
  ,GEOGCS[
    "GCS_OSGB 1936"
    ,DATUM[
      "D_OSGB_1936"
      ,SPHEROID["Airy_1830",6377563.396,299.3249646]
    ]
    ,PRIMEM["Greenwich",0]
    ,UNIT["Degree",0.017453292519943295]
  ]
  ,PROJECTION["Transverse_Mercator"]
  ,PARAMETER["latitude_of_origin",49]
  ,PARAMETER["central_meridian",-2]
  ,PARAMETER["scale_factor",0.9996012717]
  ,PARAMETER["false_easting",400000]
  ,PARAMETER["false_northing",-100000]
  ,UNIT["Meter",1]
]
'''

# Create an explicit CRS object
in_crs = CRS(crs_str)
# Define the coordinate systems
transformer = Transformer.from_crs(
  crs_from=in_crs
  , crs_to='EPSG:4326'
  , always_xy=True
)

There is a plethora of other parameters which the from_crs function accepts, which are explained well in the pyproj Transformer documentation.

Snowflake requires the geographic data in longitude and latitude format and not in latitude/longitude format as the Transformer.transform function would normally provide. The always_xy parameter takes care of swapping the result values in x and y format.

Maintaining the Structure of the Shape Data

The geospatial objects in any format (WKT, WKB, GeoJSON, etc.) are collection of coordinate values for various shapes. The actual nature of the shape would determine how many coordinates are required and even further rules can be applicable. For example, a Point shape would only require a pair of coordinates, whereas a LineString shape would require two (or more) coordinate pairs. Shapes like Polygon and MultiPolygon not only require several pairs of coordinates, they also require the first and last coordinate to nicely complete the shape and preferably no overlaps in the edges that the resultant shape would form.

The data in question contained mostly Polygons and MultiPolygons, which are represented as lists or multi-lists of tuples. For MultiPolygons, there is one more nesting of lists. This structure of the data must be kept intact while converting the values to maintain the validity of the shape. In the example below, notice one more level of nesting for the type MultiPolygon. See what all things the built in function st_transform does for us!

{
  "coordinates": [
    [
      [
        5.319848470000000e+05,
        1.818774940000000e+05
      ],
      [
        5.319886580000001e+05,
        1.818791160000000e+05
      ],
      ...
   ]
  ],
  "type": "Polygon"
}

{
  "coordinates": [
    [
      [
        [
          2.402629060000000e+05,
          1.271040940000000e+05
        ],
        [
          2.402652040000000e+05,
          1.271073980000000e+05
        ],
        ...
      ]
    ]
  ],
  "type": "MultiPolygon"
}

Next thing to consider was what data type a Python UDF can accept. As per SQL-Python data type mapping given in Snowflake documentation, a Python UDF can natively accept GEOGRAPHY data type as a parameter but not GEOMETRY parameter. Which meant a conversion of the column geo_features of table local_area_districts_staged to GeoJSON format was necessary before sending it to the Python UDF.

Python UDF for Geospatial Coordinate Conversion

Considering all the above points, the code for the final conversion Python UDF looked as follows:

create or replace function py_convert_to_wgs84(FROM_CRS string, GEOJSON object)
returns object
language python
runtime_version = '3.10'
packages = ('pyproj')
handler = 'convert_coordinates_to_lonlat'
as
$$
from pyproj import Transformer, CRS

def convert_coordinates_to_lonlat(FROM_CRS, GEOJSON):
  # Define the input CRS
  in_crs = CRS(FROM_CRS)
  
  # Output GeoJSON object
  out_geojson = {}  
  out_geojson['type'] = GEOJSON['type']
  # Define the coordinate systems
  transformer = Transformer.from_crs(
    crs_from = in_crs # Input CRS provided as parameter
    , crs_to = "EPSG:4326" # WGS84 (latitude and longitude)
    , always_xy = True # If true, the transform method will accept as input and return as output coordinates using the traditional GIS order, that is longitude, latitude for geographic CRS and easting, northing for most projected CRS.
  )
  
  # (Multi)List to store the converted coordinates
  wgs84_coordinates = []
  for coord_list in GEOJSON['coordinates']:
    if len(coord_list) > 1: # For Polygon type coord_list is a direct list of coordinate tuples
      # Convert each BNG coordinate to longitude and latitude
      lon_lat_coords = [transformer.transform(xycoord[0], xycoord[1]) for xycoord in coord_list]
      wgs84_coordinates.append(lon_lat_coords)
    else: # For MultiPolygon type coord_list should be a list of list of coordinates
      new_coord_list = []
      for coordinates in coord_list:
        # Convert each BNG coordinate to longitude and latitude
        lon_lat_coords = [transformer.transform(xycoord[0], xycoord[1]) for xycoord in coordinates]
        new_coord_list.append(lon_lat_coords)
      wgs84_coordinates.append(new_coord_list)

  out_geojson['coordinates'] = wgs84_coordinates

  return out_geojson
$$;

Notice, how handling for Polygon and MultiPolygon types is different. As the call to this UDF would require first converting the data type of geo_features column to something that the UDF can accept, it first needed to be converted to GeoJSON and then to an OBJECT data type. It would also use try_to_geography instead of to_geography system function to avoid failures occurred due to coordinate conversion. The SQL statement for that would be as follows.

create or replace table local_area_districts
as
select 
  properties
  , try_to_geography(
      py_convert_to_wgs84(
        'PROJCS["OSGB_1936_British_National_Grid",GEOGCS["GCS_OSGB 1936",DATUM["D_OSGB_1936",SPHEROID["Airy_1830",6377563.396,299.3249646]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",49],PARAMETER["central_meridian",-2],PARAMETER["scale_factor",0.9996012717],PARAMETER["false_easting",400000],PARAMETER["false_northing",-100000],UNIT["Meter",1]]'
        , to_object(st_asgeojson(geo_features)))
      ) as geography
from local_area_districts_staged
;

The benefit of accepting the CRS settings as string is that one can provide a detailed specification, even a custom specification, which is not possible to provide to st_transform function. But at the same time, a simple EPSG code would also suffice. That means in my case, following code would also have achieved the same result.

create or replace table local_area_districts
as
select 
  properties
  , try_to_geography(
      py_convert_to_wgs84(
        'EPSG:27700'
        , to_object(st_asgeojson(geo_features)))
      ) as geography
from local_area_districts_staged
;

This flexibility is the power of using Python and Snowpark!

Not only Tableau recognized the shape (GEOMETRY) data created by any of the method above, but also it was able to plot it flawlessly, finally making me and my colleague Max happy.

Concluding Remarks

Snowflake is a powerful platform. Out of the box it provides a way to convert coordinates from one geospatial reference system to other. But if you want to go further than that and support custom specifications or perform some extra steps during/after conversion, you have the entire power of Python at your disposal through Snowpark. Choice is yours!

Check out the blog of my colleague, Max Giegerich, who plays with this data further to show some interesting possibilities.

The post Converting Geospatial Coordinate Encoding in Snowflake appeared first on InterWorks.

All Data, All Workloads: The Future of Data Processing with Snowflake

Chaitanya Joshi — Fri, 08 Jul 2022 20:15:01 +0000

Disclaimer: The author is solely responsible for views and opinions in this article.

Snowflake Summit 2022 was a success. Enough has already been written about what came out of it and the exciting new announcements, so I will not be repeating any of that. Instead, this is just my musing about how I perceive the future of data processing on Snowflake.

Benoit Dageville established in the summit keynote the seven pillars of the Data Cloud foundation, two of them kept ringing in my mind … all data, all workloads.

Fuller Data Lakes Support

Snowflake already natively supported semi-structured data like JSON and XML. In fact, it was one of the USPs of Snowflake since quite some time. The launch for support for unstructured data like images and blobs in September 2021 was exciting but still had limited functionality. Only referencing and sharing the unstructured data links was possible. Such data wasn’t particularly “processible” directly in Snowflake.

Similar was the case with external tables. Support for external tables goes back to 2019. Various file formats including the popular ones like Apache Parquet are supported, but external tables are inherently limited in their functionality where one can only read from these tables and DML operations are not supported. Besides, in case of normal external tables, the table metadata would be stored in Snowflake thereby fragmenting the view of the data. The governance and insight gained on the external tables would be locked in Snowflake. But Snowflake’s commitment to data lake workloads is serious and that first showed in the announcement for support for Delta Lake table format in February 2022[1], and they topped it off with the announcement for support for Apache Iceberg format[2]. I would not go into the battle of Delta Lake vs. Iceberg, but supporting two out of three important open store formats (the third being Apache Hudi) for data lake workloads opens some big doors. Possibilities are countless – proper ACID transactions, DML operations, schema evolution and ability to time travel, that too while keeping the entire data and metadata external to Snowflake. Vendor neutrality will no longer be only a promise, data mesh will no longer be only an enterprise architectural pattern.

Native Unstructured Data Processing and Enhanced Data Streaming

The excitement doesn’t stop there. With Snowpark support for Java and now Python, one can not only store the unstructured data but also process it natively inside Snowflake. I can already imagine machine learning workflows built on the data lake and processed purely using Snowflake virtual warehouses. And there is no need to worry about productionization of your data models as you can deploy your models on any cloud gateway of your choice. You can then use external functions to call the results and predictions of your models right from within your SQL queries. This means your machine learning applications require only one familiar backend and processing engine – Snowflake.

Discussion of “all data” cannot be complete without streaming data and Snowflake is aware of it. Streaming data capabilities in Snowflake are even better now. Improved serverless streaming framework reduces the latency by factor of 10 and makes the data readily available not only to Snowpipe but even to connectors built on top of it like Snowflake Kafka connector.

Support for All Data Workloads with Unistore

Now, we’ve really talked about “all data,” and we’ve talked about almost all workloads. Why almost all? Is it not enough to support, data analytics, data science, data lake and data sharing as well as marketplace all on a single platform? Well, we all know the core workload of any data platform – online transactional workloads – data backends supporting hundreds of thousands of simultaneous transactions every second. Transactions with write (DML) operations not just queries. And the holy grail of data platforms has always been the goal to support all these workloads together as one system. One homogeneous system with no patchwork of modules and single operational semantics.

Until now, no system has been able to support transactional, analytical and data lake workloads, let alone machine learning (which is, anyway, late to the party). There have been numerous combinations (for example OLTP and OLAP together or analytical and data lake together), but no platform has tried to solve all the three use cases by bringing all the facets of data and workloads together, and definitely no cloud-based data platform has tried either. Until now.

And this was the pinnacle of announcements in the Snowflake Summit 2022. On 14^th June 2022, Snowflake announced Unistore, a hybrid workload that will support both transactional and analytical workloads together. Snowflake is taking a daring approach of row-based storage architecture, which will be intelligently mated with its traditional columnar store to implement Unistore. All of this with no strings attached! You won’t have to make any copy of your table to support online transactions and use the table at the same time as a backend for your analytical dashboard. You won’t have to keep converting the data from 3NF to star-schema formats. And you won’t have to archive your historical data into some other table so that the performance of your transactional data app remains palatable. Have you missed having the enforced constraints (primary and foreign keys) on Snowflake before? Well no longer! Combine that with SQL API and the shiny new native application framework, and you have got something very powerful. If that’s not “all workloads” then what is?

More Than a “Cloud Data Warehouse”

Even after many advanced capabilities like data sharing, governance and recent data lake advancements, some people would still refer to Snowflake as a “cloud data warehouse.” It is so unfair, because the words “data cloud” are no longer mere jargon! They have become a reality. A place where you can do anything and everything with your data. Right from gathering it to monetizing it.

The future of the “Data Cloud” is bright, very bright.

[1] Some would argue that the support for the Hive metastore was, in fact, the first attempt to wider adoption towards data lake architecture, but in the age of cloud data platforms Hive had a very limited and legacy purpose.

[2] They announced it during the summit, but private previews were started long before that.

The post All Data, All Workloads: The Future of Data Processing with Snowflake appeared first on InterWorks.

Connect to Unix/Linux over SSH via Matillion Bash Component

Chaitanya Joshi — Tue, 08 Feb 2022 21:47:42 +0000

Matillion is a versatile ELT tool. It has many connectors to numerous data sources already at our disposal, but sometimes a situation arises when your use case is a little bit different than just data extraction—for example, extracting a list of files from a remote Linux/Unix host or executing something on a remote Linux/Unix host. There is nothing more pervasive than SSH for such purposes and where would you run your script or command if not in the mighty Matillion Bash Script component!

In this post, you will learn how to use the Matillion Bash component to connect and run commands on another host over SSH.

Disclaimers

The procedure in this post was tested on a Matillion instance hosted in the AWS Cloud. The target host to be contacted was also in the same AWS Cloud account and rather in the same VPC and subnet. The target host was an Ubuntu virtual machine. However, as SSH is ubiquitous, there is no reason why this procedure would not work on any host, including Windows, when certain prerequisites like open access to an SSH port and running an SSH daemon are fulfilled.

Another item to note is that this exercise was done using a private key for the SSH, which is generally more secure than password authentication. Besides, by using a key file, you don’t have to store the password in a Matillion password manager or enable privileged user access on a Bash Script component.

Also, this post assumes knowledge and familiarity in working with the command line, Linux/Unix tools and Matillion.

Prerequisites

SSH access to the Matillion VM with privileged user (root/centos) credentials (How Matillion SSH access can be used has been explained in detail in this blog post by my colleague Chris Hastie)
The private key for the user of the target host. If the target host is an AWS EC2 instance, AWS itself creates the key.
The VM must have a public IP or Elastic IP on AWS (even in the same subnet and VPC, the communication does not work within the private IP addresses as AWS has it clamped down)
Availability of remote connectivity and data/file transfer tools like SSH/Putty and SCP/WinSCP, etc., on your local working machine
Additional setting for AWS: If the target host is an EC2 VM, and if there is a security group controlling the traffic to the SSH port (22), the Elastic/public IP of Matillion VM must be included in the inbound traffic rules. This will ensure the Matillion VM can reach the SSH port of the target host.

The Step-by-Step Process

The actual procedure is quite simple once you have all the things required (or checked all the things required). Simply start by connecting to your Matillion instance with ssh utility. The default user to connect to the Matillion instance is centos unless it was changed during the time of VM setup.

Once logged in, create a folder on the Matillion VM to store the key files. While running any commands on Matillion VM using the centos user, sudo qualifier is required to force access elevation:

sudo mkdir /ssh_keys

Using SFTP, transfer the key file (.pem file) for the remote host to the Matillion instance. On Windows, a utility like WinSCP can also be used. Move the file inside the folder.

As Matillion ELT processes are owned by the tomcat user in the VM, it is also a good idea to transfer the ownership of the folder to the tomcat user. Change the ownership of the folder recursively (i.e. also for the constituent files) to tomcat:tomcat (i.e. both user and group tomcat):

sudo chown tomcat:tomcat -R /ssh_keys/

Both the keys folder and the keys inside need to be locked down. Otherwise, using the key to connect to the remote host would yield an error like this:

Thus, change the mode of the directory to 755 and change the mode of the key file to 600:

sudo chmod 755 /ssh_keys/
sudo chmod 600 /ssh_keys/key-file-to-target-server.pem

Chris also discusses best practices and shares more details about storing keys in another blog post.

So, you have your key to the host and you have primed it up. It’s not difficult to predict what happens next. The ssh command, which is used in general to connect to a remote host, will also be used from within the Matillion Bash component. But there is a pitfall to avoid first. The next step is very important.

Before trying to connect directly from the Bash component, change to the tomcat user:

sudo su tomcat

Connect once from the command line using the key file:

cd /ssh_keys
ssh -i key-file-to-target-server.pem user@host

The host key fingerprint confirmation prompt is shown. After confirmation, the host key (i.e. the fingerprint of the remote host we’re trying to connect to) will be permanently added to the tomcat’s known_hosts file:

If the step of connecting once from the console is skipped, the Bash component errors out with a permission error, even if everything has been set up correctly. The reason is simply that the remote host entry is not present in the known_hosts file of the tomcat user. The Matillion UI cannot automatically add the host, and the Bash component does not have the ability to report it in the UI. That’s why it fails. Those feeling adventurous can try it out.

Now, connection from the Bash script component will be successful. An example of the Bash component instance is shown below. The user and host details of the target server can be stored in job variables:

An example job using the above technique could look like this:

Here, a file list is obtained from a remote host and then files are staged to S3 using the Data Transfer component.

I hope you found this helpful! Let us know if we can help you troubleshoot or sort through other areas of your data management and strategy.

The post Connect to Unix/Linux over SSH via Matillion Bash Component appeared first on InterWorks.