About Josh Narang
Josh Narang is a Consultant on the Data team.
This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY
You would be hard-pressed to find a single person who was not some degree of impressed when they first tried out ChatGPT. After its public release, the conversation in the tech space seemingly changed overnight about how AI would change everything. But much like past hot topics in the tech world – such as the blockchain, no-code solutions, or even Clippy, the Microsoft Office Assistant – it is supremely important to separate hype from practical application. It is a near certainty that things will eventually change significantly as technology matures – but now, in its infancy, it is important to think about how technology exists as it is today. Now why are we talking about ChatGPT and your data strategy?
If you are in the process of building a data strategy for your organization and are thinking about incorporating ChatGPT, it is a necessity to weigh where implementing it could help or hinder you.
In prior blog posts, we at UDig defined a data strategy with four main elements – Story, Oversight, Transformation, and Culture. My blunt belief is that ChatGPT alone is not capable of accomplishing any of those four – each requires deep institutional and organizational knowledge to succeed.
A data strategy is an overarching plan that guides your association in leveraging data and analytics to achieve your critical business objectives. It’s built upon four key concepts:
ChatGPT’s purpose as a Large Language Model (LLM) simply is not to understand your organization’s data infrastructure or risk management culture. This is a core component of data strategy where experienced stakeholders are irreplaceable for the foreseeable future (until Clippy finally ascends to a new level of intelligence). The big question is if ChatGPT, as a hot new technology, can help you compose your strategy at all. As a careful daily user in the workplace myself – I say “absolutely…with caveats.” Before jumping straight into the details of that “absolutely,” it’s important to understand the caveats, and how trying to implement ChatGPT at the strategic level without observing the fine details could quickly produce perils for your organization.
One crucial concept to understand about ChatGPT is that, as an LLM, it was trained on an immense amount of language data for the purposes of best understanding how to parse prompts and questions it is given, and then to formulate seemingly logical responses. (For a longer and more detailed explanation, see ArsTechnica’s write-up).
I would like to focus on the key word “seemingly.” Many have been impressed by ChatGPT’s ability to give eloquent answers to complex questions. A major distinction to make here is the difference between eloquence and validity. OpenAI, the organization responsible for the creation of ChatGPT, admits on their website that responses can be incorrect – using the term “hallucinate.”
I recently worked on a client project helping integrate the REST API of a 3rd party vendor product intended to better facilitate internal data governance practices. The API was undergoing rapid changes and so too, was the documentation. One key piece of the integration involved using a relatively new endpoint of the API.
After a good deal of time spent unsuccessfully using Google to find a definition of a particular parameter in the documentation, I tried to turn to ChatGPT to see if it could provide any answers. It DID supply me with an answer. This answer, however, was not a valid input for my API requests. Understanding that this could have been a “hallucination,” I clicked regenerate a few times. It apologized for the incorrect response and then provided me with new documentation examples! More than once, in fact. Unfortunately, none of those inputs were valid. This led me to engage the vendor’s support email, which supplied me with the correct inputs. Unsurprisingly, they did not align with what ChatGPT had given.
Imagine now, if instead the above scenario was focused on your organization at a strategic level rather than the individual level I was operating at. A data strategy affects an organization’s data management practices from top to bottom. Therefore, it is necessary to ask: How would the effects of a false or incorrect response ripple throughout if implemented as fact?
Another important distinction to make about ChatGPT, particularly if your organization and/or your data strategy is still maturing, is that the data that trained the underlying language model has a cutoff date. It is unaware of anything, whether it be a world event or a new data engineering framework, past September of 2021.
As somebody who has been following developments in the data engineering and analytics space for many years now, I can confidently say that the pace at which new frameworks, libraries, etc., have been developing and competing for the hearts and minds of data professionals has exploded. Whether your data strategy is about building up entirely new infrastructure for a young organization or modernizing your stack and migrating off legacy tools that no longer fully serve your data needs – it is important to have the most up-to-date knowledge when making critical implementation decisions. Your organization’s ideal solution should not be held to the same September 2021 cutoff that ChatGPT is.
As a real example – a tool called dbt has had massive adoption in just a few short years due to the way it helps organizations manage their data transformations. Its primary cloud-based solution that is marketed to organizations as dbt Cloud.
dbt had an initial release just a few months after ChatGPT’s cutoff date. When asked about a product that already touts massive enterprise clients such as Nasdaq, McDonald’s, JetBlue, and Condé Nast on its website after only a few years, here is a very telling response:
While there are other, more ground-level risks associated with implementing ChatGPT (such as one that Samsung learned the hard way – usage without strict implementation can cause big data leaks and bigger headaches), our focus here is at the strategic level. Circling back to the question posed at the beginning that asked whether ChatGPT can help you form a data strategy – the answer I provided was “absolutely… with caveats.” Now that we have reviewed some caveats, let us review the benefits.
ChatGPT is a language model – use it for language!
As mentioned prior, ChatGPT is a language model that excels at producing logical text. This means that any boilerplate text needing creation can be generated with a prompt, a click, and light editing for massive gains in efficiency as opposed to the alternative of full manual creation. Imagine if, when attempting to better your organization’s data strategy around Transformation and Culture, a defined goal was to give all organizational ETL processes clear and accessible documentation.
You can see above that a simple provided example prompt was enough for ChatGPT to hit the ground running and provide a framework for this scenario in a data strategy. From there, I would save time and quickly tailor what was provided rather than start from scratch.
Earlier, I recalled an anecdote of unsuccessfully using ChatGPT to fill in the gaps around some API documentation. While there was a teachable moment in that instance being invalid, the idea to use it for such purposes that day was not new. Many times, I have used ChatGPT in similar circumstances with success. When it produces a valid result, both the time saved from not having to hunt for a solution via Google and the additional knowledge gained from follow-up prompts are invaluable. The key here is to ensure that critical decisions from prompts like this are made and generated by subject matter experts (SMEs). When ChatGPT decides to provide you with an answer that could best be defined as confidently incorrect, the best counterbalance is to ensure you have an expert who knows enough to evaluate that answer properly.
We have already established that ChatGPT is sometimes incorrect and will not be able to learn or understand your specific organizational needs long-term. However, it is also important to note that this does not mean that you cannot ask it for general advice regarding your data strategy. ChatGPT does not know your organization, but you do! As such, you can try to incorporate bits and pieces (such as organizational size, as seen in the example below) into prompts to help it discern your needs when returning a response it thinks you will accept.
A major issue for organizations with burgeoning data strategies is data governance. Some organizations are in industries such as insurance, banking, defense, etc. that are laden with hyper-contextual terms, and those hyper-contextual terms are often reduced to even more arcane acronyms and abbreviations by those who speak in said terms as if it’s a fluent second language. In addition to that, and independent of industry, years or even decades of loose past data governance could result in specific data fields whose inputs are loosely mapped.
Imagine as an example an insurance data field where a field value that should map directly to a single distinct value such as “worker’s compensation,” is present in that field as “wc,” “workers comp,” “w0rk3rs_C0mp3ns@t10n,” etc. Although my last specific example is particularly zany, loose governance could mean a near-infinite pool of (considerably more plausible) possibilities. Now also imagine that I’m handed just the term “wc,” and I’m an outsider to insurance industry terminology who needs to dive in and start governing that data. A few targeted prompts are all it takes to create a base upon which I can start building.
First, I would try to ask about what the acronym stands for and what the term it stands for represents.
Now that I have a base-level understanding, I would ask about all the different ways that the term could be abbreviated or made into an acronym to give myself a place to start for attempting to form a plan around how I’d map all of them into a single value.
Finally, I’d ask ChatGPT to nicely put the results it gave into a format that can integrate into data engineering code quickly, such as JSON. I could quickly use this file to compare against what’s present in organizational data as it is, narrow it down, and – while it is unlikely to be fully inclusive of what needs to be fixed – start conforming some of the problem set data to the data governance ideal of a single field value for a single definition.
While this post only covered portions of the informational vastness that is ChatGPT, the intent here is to convey that careful usage CAN be a positive contributor when defining your data strategy. We live in the earliest days of the AI era, and the dizzying pace of development has already shown rapid iteration and improvement.
A final and important thought to consider when developing your data strategy in this new era is that, besides considering and applying the knowledge laid out above, your data strategy will need to be nimble.
The caveats that we defined may become non-issues tomorrow, new features may be incorporated that could directly fit your organization’s business case, etc. The future will be ever-evolving, and so your data strategy should be ever-evolving.
…just like Clippy.
If you are ready to build your data strategy, our team can help! We align your data strategy with your business strategy to ensure you make the most of your data and utilize it to its fullest potential. Walk away with a clearly defined, informed and actionable plan with our data strategy accelerators.
Additional Resources
Josh Narang is a Consultant on the Data team.