Metadata management comes with various challenges for companies. Over the years, multiple approaches to resolve these challenges have been created. Their focus: to deal with collecting and cataloguing data, describing it, governing access, and supporting data discoverability. However, most of the approaches could not serve as generic solutions: they turned out to be narrow scoped, vendor- or technology-specific, difficult to integrate and expand or they required tremendous manual maintenance effort. In this first of three blog posts, we will have a closer look at what metadata is. We will examine what it is used for and see how metadata management has been handled so far. Lastly, we discuss why we need a paradigm shift to find a new, sustainable metadata management approach.
What is metadata?
Simply put, metadata is data about data. It describes all relevant aspects of an organization’s data platform, binds systems and domains together, and is crucial for delivering insights, controllability, and efficiency. Metadata makes it possible to classify and label data to make it accessible and shareable. It captures and grades the quality (accuracy, completeness, currency) aspects of data, helps to provide security policies, enables to understand how data flows, gives information about its origin, and provides means to monitor data utilization. How this is possible? Meta data includes comprehensive information:
- technology-related metadata: it describes data stores, infrastructure, and links data to technical platforms
- semantic metadata: it shows relations dependencies among others and links data to each other
- analytics metadata: it refers to reports, models, analytic tools, and links data to analysis.
The importance of metadata management for business
In a company, metadata serves different goals: • to describe data • to provide reports, taxonomy, and ontology • and to catalog data in order to make it discoverable.
This makes metadata important not only for tech but also for the business departments: metadata is supposed to bridge technical data description with meaningful business information, make the data understandable, form a backbone for data democratization, and be a fundamental building block for data self-service architectures.
Managing this metadata is thus a key activity in order to use data efficiently and being able to make educated business decisions.
Evolution of metadata management
Originally, the problems metadata management tries to solve weren’t that prominent in the past. Highly structured data, preferred schema-on-write approaches (meaning defining the data structure before writing to DB), centralized architecture, homogenous data structures, stable processes and favoured batch processing rendered metadata easy to handle and highly static at the same time.
This has changed radically over the last decade. The approaches to metadata management that were developed based on earlier experiences struggled to keep up the pace already with the introduction of data lakes. We could see that the attitude has changed completely, and homogeneity concepts were abandoned altogether. What came then were non-homogenous data, a multitude of data sources, a variety of structures and data types, schema-on-read (an agile approach where the schema is defined while consuming data) paved the way for more modern metadata management solutions.
A plethora of specific solutions was created. Some specialized in cataloguing, others in discoverability or data flow, yet others were more full-fledged solutions but with relatively narrow (vendor) applicability.
Implications of new metadata management approaches
What do all these approaches have in common? They failed to fulfil the promise, mainly because these solutions relied overly on manual maintenance. Someone had to adapt the accuracy of data catalogues, check compliance, steer the usage control and oversee many other activities. The shift to distributed data lakes, aka data mesh, deepened the problem further. Accepting Domain-Driven Design (DDD) in data platform architectures with bounded contexts and distributed domains made it obvious how difficult it is to achieve on one side and how essential it is to the other enterprise-wide shareable data of high quality.
Nowadays, we observe another shift. Unlimited, never-ending data streams conquer all the business processes in manufacturing, finance, insurance, healthcare, and other industries. Stream processing and stream analytics shorten the time between information demand and supply. It offers a way to boost performance and replace legacy batch, periodical operations with (near-)real-time ones.
Examples for today´s data dynamicity:
- distributed and evolving domains
- highly non-homogeneous data platforms
- a variety of data, and data sources
- the demand for real-time information,
- a multitude of different technologies
Altogether, they make so-called passive metadata management unusable in modern architectures.
We need to find a new way of managing meta data
As explained in this blog post, metadata is data about data. As data has changed a lot in the past years, becoming a lot more dynamic, comprehensive, and complex, metadata has been brought to a new level too. This makes is necessary to also change the way we manage it. In order to continue describing data in a meaningful way, cataloguing it and visualizing it to make it understandable for business, a new, more advanced approach is needed. An approach that meets the requirements of today´s data velocity, volume, and structure.
In the next post we will therefore look at different approaches that have been used to tackle this new challenge and we will explain what active metadata management is about.