How to Maximize Data Governance in Snowflake Test Environment

Do you create and manage separate test environments to secure customer data? Save time and costs with Jade's Snowflake data masking solution, designed to enhance data governance and compliance in cloud test environment.

With the increasing need to comply with regulations and standards such as GDPR, HIPAA, and PCI DSS, it has become crucial for organizations to protect sensitive data. In software development, data governance becomes a challenge in non-production environments.

These non-production environments, such as testing, QA, and staging, are crucial for software development, but in these environments, data is accessed by several stakeholders and poses a risk of security and non-compliance.

To help businesses maximize data governance in Snowflake test environments, Jade, a Snowflake Select Partner, has built a unique Data Masking solution.

This blog will delve into the concept of data masking, its significance, and the need for Jade to build a data masking solution for Snowflake test environments. Additionally, the blog covers the detailed architecture of the solution built by Jade.

What is Data Masking?

Data masking is the process of replacing sensitive data with fictitious data or scrambling values that preserve the characteristics of the original data while protecting its confidentiality. The goal of data masking is to prevent unauthorized access to sensitive data in non-production environments.

Why is Data Masking Important in Non-Production Environments?

There are several reasons why data masking is necessary in non-production environments.

Compliance

Organizations are required to comply with various regulations and standards that mandate the protection of sensitive data. Data masking helps organizations comply with these regulations and standards by preventing unauthorized access to sensitive data in non-production environments.

Security

Non-production environments are more vulnerable to security breaches and attacks, which can compromise sensitive data. Data masking helps protect sensitive data by replacing it with fictitious data that cannot be used for malicious purposes.

Testing

Development and testing of software applications require a lot of data, including sensitive data such as customer information and credit card details. Data masking allows developers and testers to work with realistic data without compromising the privacy of individuals.

Cost-effectiveness

Data masking is a cost-effective solution for protecting sensitive data in non-production environments. It is less expensive than creating and managing separate environments for testing and development.

Data Masking Solution for Snowflake Test Environments Built by Jade

Jade has developed a Snowflake data masking solution that enables businesses to enhance their data governance in test environments within Snowflake while ensuring compliance with various regulations and standards that require the safeguarding of sensitive data.

Jade's Snowflake data masking solution offers an automated process that takes data from the source system, conducts PIA discovery, performs a lookup, deploys masking policies, and loads the data. By automating the data masking process, businesses can improve security, comply with regulations, save costs, and focus on real development needs.

The architecture of the data masking solution is based on Python and Snowflake and covers three use cases

Jade's Data Masking Solution - The Architecture

The architecture of Jade’s Snowflake data masking solution is based on Python and Snowflake. It uses a metadata table to store information for dynamic data masking. The solution covers three use cases:

Use-case 1: Loading masked data to non-production environments
Use-case 2: Protecting PII from unauthorized users
Use-case 3: Protecting PII data while providing it to third-party with a Snowflake reader account.

The architecture functions in the following way.

The first step involves bringing in data from the source and putting it in the production Snowflake staging area.
A PII discovery phase can be conducted to identify probable PII fields, but it's not mandatory.
Once PIIs are identified, and the metadata table is updated, a Python script takes inputs such as application name, table name, and masking type (i.e., use case 1,2 or 3)
Taking those three inputs, the script does a lookup to the metadata table, finds masking policies to be applied on each PII of the table provided in the input, and applies those policies.
The data is then loaded to the mask environment or production environment based on the use case.

Scalable Data Masking Solution for Snowflake: Individual Tables and Batch Mode

Jade's data masking solution applies to both individual tables and batch mode to take care of your entire database schema.

There is a wrapper program already built that can take input for all tables and perform the lookup from the metadata table to do the masking for all tables. This may take some time, as it will go in batches, but it is possible.

Additionally, if you want to schedule it, you can do so once a week, daily, or monthly using your existing scheduler or as a Windows service. It can also be automated with your refresh policies so that whenever you have a scheduled or manual refresh of your non-production environments from production, these scripts can be deployed to mask the data automatically.

Technical Overview: Data Masking Process for Protecting PII in Three Use Cases

The first step in the data masking process is the PII exploration phase. This involves running a predictive algorithm on the table data to determine the probability of each field being PII and, if so, what category it falls under. The results are stored on a table, with a JSON data file generated for each application table.

To view the data in a tabular format, a query can be run which displays the probability of each field being PII. Fields with a probability of 100% are considered PII, while those with a value of 0% are non-PII. Fields with a probability greater than 0% are assigned a semantic category, such as address or date of birth.

Based on compliance and organizational needs, specific fields can be selected for masking. Here is a technical overview of how the solution conducts to support the following three use cases:

Use-case 1: Loading masked data to non-production environments

Python program takes table name, application name, and masking type as input
Looks up metadata table, applies the on-the-fly masking rule, and loads to the final tableto masked environment
Automatically selects masking policy
Masked data loaded into a masked environment
The masked data table reflects masked data, except for the ID field (for comparison purposes)

Use Case 2: Protecting PII from unauthorized access

One instance of the Production table data remains in storage. No additional table is maintained for masked data. Real data is accessible/visible to authorized users only. For unauthorized users, data appears as MASKED values of the PII Fields. Only authorized users see the actual PII Value.
Python code with specific parameters deploys masking policies on PII fields in the table
Masking policies are deployed on fields like name, date of birth, etc., to protect sensitive information
Customization is available to hide certain portions of data for specific users
Data protection achieved for test users accessing the testing table

Use Case 3: Providing data to third-party with a Snowflake reader account

A reader account with permission to read the production table but no right to modify data is created
Require providing test data to a credit unions regulation like TransUnion or Equifax
Python code is used with specific parameters to deploy masking policies on PII fields in the production table
Data protection is achieved for the reader account without exposing sensitive PII
No need for code changes to update masking policies
Metadata table used to update masking policy information and automate the process
The implemented solution can be easily scaled to cover additional tables and applications

Endnote

To sum up, Jade's automated data masking solution helps businesses maximize Snowflake data governance and stay compliant with various regulations and standards that mandate the protection of sensitive data. The solution offers several benefits, including improved security, compliance with regulations, cost-effectiveness, and realistic testing.

The architecture of Jade's data masking solution is based on Python and Snowflake and covers three use cases. Jade's data masking solution is scalable and applies to both individual tables and batch modes.

To view the technical demo of the Data masking solution built by Jade, please watch this on-demand webinar Maximize Data Security in Snowflake Test Environments which also includes a detailed Q&A session where you can get answers to your probable technical questions.

For more information or a personalized demo, connect with Jade today.

Subscribe to our email Newsletter

About the Author

Mitali Sharma

Associate Manager-Content Strategist at Jade Global

Mitali Sharma, an Associate Manager-Content Strategist at Jade Global, has been a technology writer for the last ten years. She writes about technologies like Cloud Computing, Robotic Process Automation, Enterprise Applications, digital transformation, etc.