For contractual reasons, compliance with standards, or regulations/legislation, many organizations today struggle with the problem of development teams not having access to production data.
The problem is notably worse when teams have to investigate issues reported by users in production. Is it possible to provide engineering teams with all the information they need as if there was no restriction? Let’s explore some alternatives, but first, let’s understand what information we are trying to protect.
Personal Identifiable Information (PII)
According to the National Institute of Standards and Technology, PII is information that can be used to distinguish or trace an individual’s identity — such as name, social security number, biometric data records — either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual (e.g., date and place of birth, mother’s maiden name, etc.).
The University of Pittsburgh has an excellent list of examples of what could be considered PII: Guide to Identifying Personally Identifiable Information (PII).
PII is the sensitive data that we need to protect. Let’s see some examples of how we can avoid hindering our engineering team’s performance while keeping these data protected.
The first problem is, of course, with logs. Many logs may contain Personal Identifiable Information (PII). Let’s see a list of recommendations:
1- Use a search tool on the code to find all the logging calls and review them, prioritizing components and packages that process PII (registration and account management/update, for instance). Look for exception messages too.
People often forget about exception messages that show up on stack traces. As you review the calls to log entries, please also review the exception messages and clean them up.
2- Search your logs using regular expressions to find emails, phone numbers, zip codes, addresses, etc.
3- Adopting logging standards should make it a lot easier to maintain a healthy log. Here’s a suggestion:
Until you can clean your logs, I suggest storing them encrypted (sometimes just a compression with a password will do the trick) and having someone with production access fetch only the part needed, cleaning the PII before sending it away.
If the system provides an error code to the user (to be informed to the support team upon opening a request), don’t forget to include the error code in your log entries. It would help if you also had an Exception framework that auto-generates and appends the error code (that could be a simple UUID) to the exception message. It will make finding the exact log entries much easier.
A data scrambler is a tool that needs to be custom-built to each database. The data scrambler replaces all PII with tokens so that the data is still consistent. It preserves the relationships between the data and replaces each piece of data with the same unique token everywhere.
For instance, if you have a health insurance system where there are claims from the insured and the dependent, the primary insured’s name will be replaced by the same string everywhere. Another bogus string will replace the name of the dependent everywhere. So you have a 1:1 relationship between each PII and its replacement. Doing so ensures that the data structure, relationships, and integrity are preserved.
The data scrambler then exports the SQL code necessary to load this data on a database to be provided to the engineering team.
There are several ways to accomplish this. The easiest way is to write a script to do the scrambling, run it on a copy of the production database, and provide the engineering team with a dump of this data (or access the scrambled copy directly).
If you keep the mapping between the original data and the replacement, your scrambling tool might offer the ability to reverse the changes. It might be helpful for the support team in their communication with users.
You may also want to set up an ETL process that directly updates a target database with scrambled production information. This can speed up the process, especially if the replacement mapping is kept after the process is executed. The support team can quickly point out to the user or other data relevant to the issue, even if it is scrambled, by referring to the mapping.
The most complex solution involves the ability to make a vertical cut of the data, encompassing everything related to the user who created the issue and possible environment configuration/settings that could be loaded on an existing internal database. That is much more sophisticated to be built and maintained.
Policies around PII collection
Your organization should have policies to guide PII collection. As part of these policies, you must ensure that the new PII will be scrambled by the scrambler tool. So, adding PII to the data storage means that you will have to add it to the scrambler.
Fewer tickets for development teams means friction with data access
One more thing to consider is to empower your Customer Support team to handle more issues independently. Here’s a post about it:
If you like this post, please share it (you can use the buttons in the end of this post). It will help me a lot and keep me motivated to write more. Also, subscribe to get notified of new posts when they come out.