About Zach Showalter
Zach Showalter is a Senior Consultant on the Data team.
This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY
As organizations grow increasingly data-driven, the ability to quickly discover, understand, and trust internal data becomes more than a convenience—it’s a necessity. Over the past year, I’ve spent more time exploring data catalog solutions and the pivotal role they play in solving a challenge I frequently hear from clients:
“We know we have the data, but we can’t find it, don’t know who owns it, or aren’t sure if we can trust it.”
Even in smaller organizations, data is often scattered across platforms, locked in silos, or poorly documented—resulting in a frustrating roadblock to true data self-service. This growing complexity has made it clear that without better tooling, data governance and discovery become reactive, inconsistent, and slow. A well-curated Data Catalog has gained traction throughout the industry as a solution to this problem, and today I want to talk about how the recent rise of open-source options presents a low-risk, high-value opportunity for organizations to explore the benefits of these tools.
If you’re unfamiliar with Data Catalogs as a concept, they are a tool that functions as a live-service repository of an organization’s data assets, metadata, and data-related intelligence. The Catalog integrates with databases and other data sources to ingest the metadata from those sources (and just the metadata, not the actual data itself), and it provides an interactable front end with search-engine-like functionality that allows users to easily discover organizational data and the context surrounding it.
Data Catalog platforms offer functionality for users to:
When well-curated, a Data Catalog:
It is especially valuable for data engineers, analysts, and business users who need reliable and well-documented data for decision-making.
Problem solved, right? Unfortunately, the reality of implementing a Data Catalog is not always so straightforward. With massive amounts of metadata to ingest, curate, and define, it can be difficult to know where to begin without formulating an implementation strategy for data ingestion, onboarding users, and overall adoption of the tool.
Furthermore, while there is no shortage of Data Catalog tools available, they are generally offered as managed SaaS platforms commanding their own hefty price tag and a lock in to that vendor’s service. Even with the benefits that the Catalog offers, it can be hard to justify the cost of another managed service alongside more critical services needed for day-to-day operations. With all these factors to weigh, making a choice to pick and implement a Data Catalog platform becomes a far more daunting one; one that often gets pushed further and further down the line as something that’s “nice to have.”
This is where the new wave of open-source Data Catalog platforms comes in. While tools like Apache Atlas have been around since 2018, recent years have brought more feature-rich options to the table.
The key benefit?
Organizations can host and control their metadata catalogs in-house, avoiding costly SaaS subscriptions. The only expenses are infrastructure-related.
Even more important than cost is the agility open-source tools provide. Because there’s no financial buy-in required, organizations can:
A great example of this approach is the platform I chose for my own testing: OpenMetadata.
When I began exploring Data Catalog tools, I was drawn to OpenMetadata due to:
Launched in 2023 by the team behind Uber’s metadata platform, OpenMetadata features:
Architecture diagram
Architecture documentation
For testing, OpenMetadata offers a quickstart Docker Compose setup. With some basic Docker knowledge and their setup guide, I was able to deploy a local instance and start ingesting metadata within minutes.
Standing up a hosted proof-of-concept version takes a bit more work depending on your stack and permissions, but it’s very doable. You can typically:
…all in under a week.
This agility allows you to test the tool with your real metadata, which is far more valuable than viewing a generic demo. And if it’s not the right fit? No problem. You can decommission easily with minimal loss. If it is useful, you can keep building it out—or pivot to a managed version later.
If you’re interested in looking into Data Catalogs for yourself, this repository maintained by OpenDataDiscovery offers a high-level feature breakdown of the major open-source and proprietary platforms as well as links to their websites. It’s an excellent resource for making your own comparisons:
https://github.com/opendatadiscovery/awesome-data-catalogs
Additionally, if you’re curious enough from reading this article on wanting to give it a try, OpenMetadata maintains a live sandbox page that you can access to demo interacting with the webapp. For details on how to access it, follow the link here:
https://docs.open-metadata.org/latest/quick-start/sandbox
Zach Showalter is a Senior Consultant on the Data team.