A privacy-preserving approach in data streaming architecture

March 29, 2022
by Pawel Wasowicz

The (near-)real-time business insights based on continuous data streams are in ever-growing demand. Stream processing and analytics have the potential to improve companies' competitiveness considerably. However, this trend is challenged by the need for data security and privacy. Legacy approaches to data security, used extensively in the single data silos setups, are not suitable for modern data platforms, which leverage streaming architectures. New and strict regulations like GDPR, HIPAA pose additional challenges for data-centric companies. This blog post highlights relevant concepts for successful privacy-preserving data platforms.

Introduction

For successful privacy-preserving data management, an organization needs to combine data governance and data security. While data governance focuses on managing processes and metadata, data security focuses on restricting access to data.

Standard data security techniques provide auditability of the whole data platform, traceability of users' actions, and secure access to resources. This, however, is not enough, as recent data breaches show:

  • Trusted services can be compromised
  • Data leaks can occur
  • Privacy regulations like GDPR or HIPAA bring more attention to this area and are a main driving force behind technological advancements in the data privacy domain
  • Consequently, standard data security solutions aiming at reducing the likelihood of data breaches by preventing unauthorized access and sealing data at rest need to be complemented by other sets techniques fundamental in the data privacy context.

The goal: Data privacy is in place to eliminate the effects of data breaches and sensitive data exposure by applying safety measures to data itself.

Sensitive data and data protection – what challenges do companies face?

Sensitive data can be described as any data that requires a special trust model, following regulations enforced by, e.g., government or industry.

It may be:

  • Personally Identifiable Information (PII)
  • Private Health Information (PHI)
  • or Non-Public Information (NPI)

Data protection, on the other hand, is understood as data security and data privacy combined and shall become ubiquitous in data platforms over whole data pipelines. The protection of user data or sensitive data, in general, shall be of the highest importance in any organization however, the goal of securing data naturally competes with its utility, i.e., applying randomization techniques to all sensitive data renders them unusable for analytics and having data siloed invalidates both federated and streaming analytics approaches.

It is not easy to preserve data utility and at the same time comply with privacy regulations. It is far from enough to build data protection solutions on notice and consent notion only. Authorized parties must respect data privacy; hence it must be controlled what can be inferred from data.

The need for real-time insights usually accomplished through data streaming architectures poses a further challenge for data privacy. Data are no longer only "at rest" persist safely in a dedicated storage. Data continuously flows along data pipelines. It is "in transit", and due to the streaming analytics, it may be continuously "in use."

These three aspects: "at rest", "in transit", "in use" make it a real challenge to develop privacy-preserving data platforms. This may then hamper the process of adoption of modern solutions for data platforms.

Data Security, Data Privacy, and Data Encryption

Data security and data privacy complement each other. However, it is crucial to differentiate between them and recognize techniques used by data security and data privacy.

As mentioned, data security focuses on preventing unauthorized access to and usage of data. Originally, when data was mostly static and stored in central warehouses with the data-related jobs that run within these silos, organizations relied on access controlling techniques. It may have been role-based access control (RBAC) which was already an upgrade over access control lists (ACL). However, it had its shortcomings and the need for fine-grained control pawed the way for attribute-based access control (ABAC). ABAC is one of the most sophisticated and fine-grained security methods. Its techniques utilize attributes of data itself, metadata, system's context, and users' properties.

Another pillar of data security is data encryption. In case of data "in transit" most of the time TLS-channel tunneling was considered enough with eventual integrity and non-repudiation mechanisms used. For data "at rest" the cryptographic encryption was applied on a dataset, file system or whole disk level. In case of legacy architectures which leveraged static and centralized data silos, these concepts did provide desirable security level as data rarely left the storage and processing happened in these silos too.

On the other hand, a data privacy toolset tackles the problem of preserving a high utility of data and maintaining its privacy in dynamic and decentralized data platforms. As we already know, data privacy's goal is to comply with legal privacy regulations like GDPR. But in addition, at the same time, it's focus lies on preserving its high utility across an organization.

Three ways to successfully enable data privacy

There are basically three complementary ways to improve the privacy of data.

  1. Carefully split PII and other sensitive data from the rest and democratize access to only non-sensitive parts: That way, both datasets can be managed differently, i.e., for GDPR's "right to be forgotten" rule, it is enough to delete respective records only from PII dataset.
  2. Apply anonymization and tokenization methods: With these methods, individual privacy gets improved while the ability to perform data analytics is retained.
  3. Apply encryption: In data streaming architectures, this translates to message- or even field-level cryptographic encryption of messages "at rest" and "in use".

All three complement each other and should be used as building blocks of a privacy-preserving data platform. However, as no solution will ever be 100% bulletproof, the first and foremost approach shall be to store a minimum of data that are needed.

What does that mean for a company?

Privacy regulations affect all possible kinds of data and are to be respected in every industry. These regulations: GDPR, HIPAA, PCI DSS, to name a few, could severely constrain the way organizations utilize their data and as a result, negatively impact competitiveness and new product development. Careful examination of possible solutions of privacy-preserving techniques is crucial for organizations to release the potential hidden in their data.

To summarize, data security aims to ensure confidentiality of sensitive data and secure access to resources while providing auditability of activities in the system and traceability of users' actions. However, the trend to move to federated data streaming architectures on one side and new privacy regulations on the other require additional techniques to keep the utility of the data high. Such techniques are associated with so-called data privacy concepts that we will explain in the following blog post, so stay tuned!

Back to overview

The author
Pawel Wasowicz
Pawel Wasowicz
Head of Data Engineering at mimacom
Software and data engineer with mimacom for 6 years now. Enjoys building solutions leveraging data research and analytics as well as rules-based programming. Currently leads the effort to establish successful data engineering division.
Comments