You have surely noticed that there are many different types of data in your company. These differ, for example, in their source, format, quality, quantity, maturity, and potential for your business. This diversity of data and the various forms in which it is available means that applying a single strategy for its meaningful use, for a shared infrastructure, or even for a particular technology is not feasible in most cases.
But that’s a good thing! It is even desirable. Surprised? Then read on!
The starting point: Legacy systems and all-in-one data platforms are unable to meet the requirements of modern data management
Current challenges make data management in many organizations very chaotic. Data is collected for technological and functional purposes, as well as to ensure compliance with laws, and must therefore meet a wide range of requirements. In many cases, this has resulted in the use of a large number of separate data platforms. Most of these data platforms focus on specific tasks, such as data streaming, data analysis or ML. Other data platforms do the opposite, claiming to be an all-in-one solution that meets (almost) all requirements for the use and processing of data within a single data platform. They often use aggressive marketing techniques to convince their customers at the start that all their expectations and complex requirements could be met with this one data management tool. Naturally, that sounds tempting! But it quickly becomes clear to many users that their challenges are greater than originally thought and the promising tools prove to be a disappointment.
As a result, many organizations have to rely on multiple legacy systems to cover all tasks and areas of data management and often do not have the means of migrating them or cannot justify the costs involved in migration.
The result: a revolutionary development of data platforms and data architectures
The radical transformation in the data sector also contributes to this problem, which is briefly outlined here using four milestones as examples:
Data lakes: To solve the challenges faced by companies in data management, there is a trend towards expanding data warehouse architectures with OLTP (online transaction processing, a type of data processing where several transactions are carried out in parallel) to include NoSQL solutions. These have taken over the world of data management in the form of data lakes within a very short time. Advantages: The format does not matter with data lakes. Data can be stored, retrieved and processed there in its current form.
Data streaming: OLTP quickly reached its limits, which is why this type of data processing soon resulted in what is known as data streaming architectures. Advantages: Unlike the parallel processing made possible by OLTP, data streaming architectures even allow data to be processed in real time.
Data lakehouses: However, real-time processing is not the end of the development of various approaches to efficient data management. With a data lakehouse – a data management architecture that combines the advantages of existing data warehouses and data lakes – companies aim to dispense with centralized data silos but still consolidate the most relevant tasks relating to data governance and security with data fabric architectures. Advantages: Combines the benefits of data warehouses and lakes, offering benefits such as reduced data redundancy, simplified data observability, and reduced data storage costs.
Data mesh: The concept of data mesh has had a major impact on the field of data management over the last 4 years. Data mesh focuses less on technologies and more on the integration of tested organizational structures such as domains, flexible teams and DataOps. Advantages: This increases a company’s scalability in terms of its data management. They can easily work with a rapidly growing amount of data and successfully serve both new use cases and a growing number of data users.
The challenge: data protection policies for secure data management and artificial intelligence as a new influencing factor
But all these developments in the field of data management are reaching their limits, given that new data protection regulations have emerged all over the world in recent years.
These regulations, like the General Data Protection Regulation in the EU, are very demanding, but generic.
Others go further and strictly control the exchange of data (such as the EU-US DPF currently under debate).
Still others focus on specific aspects, like the European Data Governance Act, Data Act and Artificial Intelligence Act.
All of this influences data management in every organization and regulates innovative approaches to protecting privacy and security aspects. When it comes to artificial intelligence in particular, it should be noted that the huge advances in the field of AI are likely to soon lead to further changes in data management. It is expected that all current innovations in the field of artificial intelligence will have a significant impact on how companies organize their data landscape (manage their data) in the future. These innovations in the field of AI currently include architectures based on transformers (known as deep learning models), the GPT-4 model (such as the well-known Chat GPT), new, potentially interesting approaches in the field of general artificial intelligence, and automated machine learning, to name just a few.
The result: limitations in data management could soon be a thing of the past
The field of data management is changing dynamically, the demand is high and the number of influencing factors is immense. The fact that there is currently no clear, all-round winner among solutions for efficient data management makes the situation even more complex. Nevertheless, I would like to share a few tips that I believe have proven to be effective and will continue to be effective in the future.
The goal is clear: Companies would prefer to ask analytical questions directly, in their own language, with the assistance of an AI, like J.A.R.V.I.S. in Iron Man, which are answered immediately based on the available data. Sounds simple and very practical, doesn't it?
An approach like this would have unlimited potential when it comes to the spectrum of questions. It would also enable data to be made available to a much wider audience. Companies would therefore finally achieve the desired data democratization.
Is that truly so unrealistic? Well, although the world's latest toy, ChatGPT, is not yet viable for this application, it is clear that the limitations of Large Language Models will soon be overcome. More variants of this or similar architecture are likely to come sooner than expected.
But before investing in a specific data mesh technology that is multi-cloud and polyglot-capable, it is important to first get an understanding of viable solutions for the individual requirements in your company.
The solution: Successful approaches to the complex requirements of data management
I will briefly present four of these approaches below, each with examples, tips and solutions.
1. Data discoverability
Long gone are the days when companies first had to collect data in one place, whether in a data warehouse or a data lake, to make it available.
The focus today is on investing in the discoverability of data. Simply put, this means that organizations need to focus on making the vast amounts of data they manage easily searchable to enable the relevant roles within the organization to find the data they need.
Example: A product team needs data on user experience and product usage, while a marketing team needs access to target groups and customer requirements. A business development specialist, on the other hand, must always have an eye on the company data with a view to achieving goals and objectives.
Tip: Discoverability of data should therefore help all potential users to find useful information in the company, follow the rules for correct data use and learn about the structure – especially the schema of the data. Some time ago, the goal was to reduce this task via a manual process of cataloging (data cataloging) and a highly specialized, i.e. technology-based, data lineage solution. However, this approach proved to be too complex and unsustainable in most cases.
Solution: Today, modern data discoverability solutions are based on automated processes, often from the field of ML. These have brought about a change in data management in that data is now considered a product. This means that the domain managers make the curated data sets available, which are described with the help of attached metadata, for example.
2. Data virtualization and data gateways
Data virtualization refers to a logical data layer that integrates company data from different systems and makes it available uniformly and centrally in real-time. This enables efficient data management despite a fragmented data landscape.
Example: Business users can create a standardized data report that includes data from a variety of sources as it is retrieved from data virtualization and made available in an integrated view.
Tip: To ensure that these products, i.e., the data, can be found and searched from several domains, you should establish a transparent data architecture in your company. This includes jointly developing and adhering to a strategy, including regulations and specifications for data management. Concepts such as data virtualization and data gateways are also part of this.
Solution: A virtual layer like this can support various data retrieval technologies such as SQL, REST, and GraphQL and, thanks to further abstractions such as data gateways, can also cover schema, security, and scalability requirements.
3. Data marketplace and data observability
One of the counterarguments in the discussion around modern data architectures relates to the additional effort that arises from maintaining the data or from the generalization of the interfaces. Unfortunately, additional costs cannot be avoided if a company wants to steer clear of the uncontrollable fragmentation of the data landscape while making use of the potential in its company data.
Example: Such costs are incurred in a wide range of areas. Storage, ingress, and egress costs in increasingly popular multi-cloud architectures should not be underestimated – but are worthwhile, considering the losses caused by unused data.
Tip: To get the most out of your data, you can allow access to data using an internal data marketplace. This is another step towards data democratization and increases the usability of your data.
Solution: A central platform where data can be offered by producers and found by consumers can also help to spread the costs associated with data sharing across the company. An internal data market can also be linked to certain aspects of data monitoring. This allows you to manage data management strategies more effectively and calculate and charge costs more specifically, based on the measured popularity of the domains, for example.
4. Data-driven organization of a company with Data Ops teams
All in all, there’s no way around adapting your organization if you want to manage your data efficiently and make the most of it. Highly specialized teams that are limited to one data management technology are more likely to contribute to further fragmentation of the data landscape. Practices such as DataOps should be introduced instead.
Example: Just as many legacy systems can only cover one area and data is therefore isolated in many systems, specialized technology teams tend to help maintain this situation if they focus on just one application.
Tip: DataOps, on the other hand, is about shaping an overarching data community. This encourages the exchange of ideas and joint R&D activities to enable consolidation and unification in certain critical aspects of data architecture such as security, discoverability, observability and measurability.
Solution: DataOps teams are responsible for the handling of data required in the company. They discuss integration topics and manage tools that simplify the connection to the data platform centrally, and are therefore definitely a successful approach to data management. These teams are also responsible for tasks such as establishing best practices, defining naming conventions, generalizing metadata and maintaining self-service portals or GitOps pipelines.
There are many reasons why the data landscape in an organization may be fragmented. Legal regulations, separate specialist domains, and different needs and objectives – to name just a few examples – contribute to the fact that the data landscape will be even more fragmented in the future.
Certain practices can be introduced to make this division controllable for organizations and avoid isolated data silos. The aim of this blog post was to highlight the fact that before data can be used, it must first be made comprehensible and accessible. However, this is rarely achieved successfully by introducing a new technology alone but can succeed using the four approaches described.
Based on my experience with many customer projects, I can say that our experts at Mimacom can successfully solve a wide range of data platform and data management challenges for you. Feel free to get in touch. We will be happy to advise you on your individual data architecture.
Located in Bern, Switzerland, Pawel is our Head of Data Engineering. At Mimacom, he helps our customers get the most out of their data by leveraging latest trends, proven technologies and year of experience in the field.