Jade’s Test Data Management System using Snowflake Data Masking
Data runs the current business world. In all the business domain like Banking & Financial Services, Insurance, Healthcare, Manufacturing, Retail, Media etc. data plays a vital role in driving the business in most efficient manner. It helps in core operations, services, customer support, marketing, customer segmentation, decision making, forecasting, budgeting and planning. Data traverse through multiple connected and unconnected systems of any organization through real time as well as batch processes. Organizations share data among themselves for the business needs and growth as well. Starting from the root level developers all the way to the C-Level managements, everybody use data for their duty.
As data is the driving factor of most of the organizations it is also equally important to make it highly secured so that it should not be misused or misinterpreted by any user, process or application. In certain countries customer data protection is considered to be mandatory for the organizations. E.g. GDPR mandate in Europe countries.
Need for Data Masking:
Organizations maintain different environments/systems like Production, Development, Test, etc. for all the applications they use to run the business. Apart from Production environment rest all environments are non-Production environments/systems. They use these non-Production or Test systems for different types of Quality Assurance checks before implementing any new application or change to any existing application.
Customers’ personal data is highly sensitive. Most of the organizations want to restrict the Personal Identifiable Information (PII) of Production Data to any Test Systems/Applications or Test System Users. Examples of PII are Name, Account Number, Customer Number, Address, Phone Number, Tax ID, SSN, Email ID, Date of Birth, etc.
Objective is that no individual/application from Test System should be able to uniquely identify any customer real information. There comes the necessity of the data masking. If the PII elements of test systems are masked then nobody or no application can identify the customers’ real information. There are several tools in the market to do the data masking of test systems.
Snowflake provides in-built dynamic data masking policies by which organizations can mask the test data in runtime. Please refer to the following link for more details.
Need of Snowflake Test Data Management System:
In most of the organizations the consumers of the data can be applications like a centralized Data Warehouse system, Downstream Business Intelligence (BI) Reporting Tools, Marketing Campaign Department/Tools, Customer Relationship Management (CRM) Systems, Telephone Customer Service Department, Sales and Services Department/Tools, Data Science Pipelines used by different departments, Core Operation System like Mainframe, Ecommerce Online Application, Third Party Vendors, Governmental Compliance departments, Business Partners, etc. All these tools/departments need mask data for QA purpose. So, following are the main pain points and the need for the Jade’s Reusable Snowflake & Python based Test Data Management System.
- If any organization is already using Snowflake as their licensed Cloud Data Warehouse or Database for any application then they don’t need to purchase any additional tool for doing the data masking tasks for test systems. They can use Snowflake’s inbuilt dynamic data masking. Snowflake dynamic data masking masks the data on the fly while data is being read by test users/applications. But one of the main challenges in real world is that organizations have tons of tables which stores PII data elements. Sometimes organizations need specific customized masking rules to be applied to some of the PII elements. If organizations go for data masking, then they have to individually set up the masking set up for all the tables and PIIs and tag them properly. This will be a time-consuming process and will have to do the set up every time they add a new table having PII data element. In this case Jade can help organizations to use one reusable metadata driven solution through which they can get rid of lot of manual interventions in the data masking process.
- Now let’s now see another real-life use case now. Organizations have multiple databases (Mainframe File System, DB2, SQL Server, Oracle, Teradata, PostgreSQL, Snowflake, File Systems, etc.) for multiple source applications. Let’s say the Ecommerce application of one of the banking organizations undergoes system upgrade and they need test data for Quality Assurance purpose. Let’s say here the application is in PostgreSQL database. At the same time another Lending system upgrade is also happening. They also need test data for the lending system which is in SQL Server data base. And let’s say their centralized cloud data warehouse is in Snowflake. So, in this case to provide the masked test data to both the Ecommerce and Lending Systems organization needs to pick the source data from these applications and generate the masked data out of those. Then feed the masked data back to these 2 systems. Also the masking policy/rule applied to any PII element should be consistent throughout the enterprise. This needs proper co-ordination and communication among the departments. In this case organizations will need a masking tool to generate the masked test data for these 2 systems. As we can see the organization is already using the Snowflake Cloud Datawarehouse. So, they don’t need to go for another masking tool. In face they can leverage the Jade’s Snowflake & python based reusable solution to generate masked test data for both the Ecommerce and Lending Systems.
- There are some organizations which haven’t started using data masking governance policies yet and they are unsure about which all tables contain PII elements. If they need to start the process then this solution can also be helpful for identifying the PIIs of all the tables as well.
Architecture of Jade’s Solution for Test Data Management Systems:
There will be reusable modules created in Snowflake. We just need to provide the connection details and the Metadata table details. Then rest all should be auto driven. Following are design flow and the architecture diagram of the solution.
Step 1: Provide the source table/file connection/path details in the reusable config file.
Step 2: Enter the metadata information which will drive the data masking of the required table/file and fields.
Step 3: Bring in the Production data from the required source system into Snowflake production stage area.
Step 4: Create a inbound data share of the production stage table in snowflake test system.
Step 5: If the organization wants to find the PII lists then there is a module which identifies that and provides a list of PII fields. Still, it needs some manual verifications to finalize the PII elements based on domain knowledge.
Step 6: If PII identification is already done as prerequisites then python program will check the metadata table to find which all fields are PII elements and which all masking policies to be applied for those PIIs and apply those on the fly on PROD Stage table.
Step 7: Run the ETLs/Stored Procedures to read the masked data from the data share and load the curated and consumption layers.
Step 8: Run Python scripts to read the masked data from the data share and load to the source applications which need test data.
Sample of Metadata Table for driving the Masking operation:
Using the Jade’s provided “Snowflake Test Data Management System” solution any organization can have their test data masked for any application/system, any number of tables, any number of data elements within very limited time by just providing the connection details, database names, table names, Data Element names to be masked. This will ensure organizations don’t use any other licensing tool for data masking if they already have snowflake license. This will help organizations to generate and use the masked test data using only one system throughout the enterprise which will maintain the integrity.