· Insights · 9 min read
How Data Provenance Drives Machine Learning Risk and Value
Data provenance, the origin and history of data, is crucial for machine learning and AI risk management and value creation. It affects data quality, legal compliance, and ethical use. Board members should ensure their organizations have robust data provenance practices to mitigate risks, enhance AI model reliability, and protect company value. Understanding data provenance is key to responsible AI governance and strategic technology oversight.
The Provenance of Provenance
For many, provenance is a foreign term, frequently (and ironically) confused with the Provence region of France. But if you’ve ever heard the story of a rediscovered work of art, a classic car with a famous owner, or a counterfeit bottle of expensive wine, you already understand the importance of provenance.
Provenance is just knowing where something came from. You might even recognize the veni in its Latin form, provenire, as the “veni” in Julius Caesar’s famous “veni, vidi, vici.” So, while provenance is the technical term, you can substitute “origin” or “lineage” in most conversations.
In art, we can simplify provenance as answering two questions:
- Who created the work?
- Does the current possessor have the right to transfer?
When it comes to collectibles, the chain of ownership may create more value than the work itself! A vintage car might be worth $1M on its own, but much more (or less!) if Frank Sinatra or James Dean once owned it. Conversely, those works without provenance or with poorly-supported documentation – often stolen – are typically sold far below true market value.
As we have seen, provenance typically boils down to documenting a timeline of ownership or transfer. The importance of clearly establishing these facts and tracking them over time can be seen in many fields: art, archival processing, and, as we’ll be discussing today, technology.
Data Provenance: A Board-Level Concern
What does provenance have to do with data? As board members overseeing technology strategies and risk management, understanding data provenance is crucial for ensuring the reliability, legality, and ethical use of AI and machine learning models in your organization. Good (or bad!) data provenance can have a significant impact on machine learning models, operating efficiencies, and financial outcomes.
Let’s reframe the two questions about art from above in the context of data:
- What individuals or organizations are described in the data?
- Does the current possessor have the right to transfer or use?
Answering these questions is critical for at least three reasons:
- Know Thy Data
- Contracts
- Regulations
Know Thy Data
First and foremost, how much value can you create from data if you don’t understand or trust it? If you aren’t clear about the entities, actors, or actions in a data model or data sample, then how are you going to make inferences or take action? And if you can’t make inferences or take action based on your data, what’s the point in collecting it? Clearly, you can create more value if you spend the time to “know thy data.”
When you collect data directly from the “source” of data – e.g., when you ask someone to rate a product they have purchased from you – you can directly record the source and examine data quality issues. But when you acquire data “second-hand” or “third-hand” from someone else, trust becomes increasingly important. In these second-hand and third-hand situations, “know thy data” really means “know thy data provider” too.
Contracts
Second, you may have purely contractual obligations or rights that need to be considered. For example, if a contract explicitly prohibits an organization from re-using or re-distributing data, then any “downstream” use or work products will create breach of contract risks.
There are two common examples of relevant contract terms:
Purpose of Use
This term limits the purpose of data use to the scope of the customer contract; in other words, while a service provider can retain and re-use a customer’s data to provide a product or service to that same customer, the service provider has no other legitimate use.
Anonymizing the customer’s data or aggregating it with other customers, even if well-intentioned, still likely violates the purpose clause.
Confidentiality + Non-Disclosure
The second common contractual term is the confidentiality or non-disclosure obligation. Such terms explicitly prohibit a party from redistributing information without consent of the “owning” party. In many cases, organizations acquire such restricted information from a third-party without knowledge of the “upstream” issue.
While you might think that the damages or inconvenience should be limited to the party who violated the contract alone, the truth is that the information’s “owner” and many courts may not agree.
Regulations
The third reason for ensuring data provenance is the most publicized and fundamental: regulation. Laws and rules are meant to be complied with, and almost all commercial contracts, financing documents, and purchase/sale agreements include fundamental representations and warranties related to compliance with applicable laws and rules.
Federal and state laws (and EU-wide regulations, for my continental readers) may require organizations to engage in specific documentation with respect to data that they gather. Generally, these requirements are limited to data that is personally identifiable; the specifics of what data is considered identifiable varies by regulation and may take additional factors into account.
Neglecting to appropriately consider data protection regulations can have a direct and severe impact on an organization’s machine learning models, operations, and overall financial wellbeing.
While I won’t dive into the specifics of data processing regulations in this post, it’s sufficient to understand that oftentimes a company’s use of consumer data is regulatorily limited to those purposes for which they have legal grounds to do so. So how do you ensure that the data you’re using is allowed under law?
Legality of Data Use
This will usually be determined by a company’s legal department, outside counsel, or an internal governance position. If consent is the basis for processing data, there should be a means by which data can be tracked to consent. This might show up as attached metadata, or, if the data is not anonymized, may be linked directly to the data. Note that some of these methods may result in additional instances of identifiable data, which would need to be appropriately addressed.
It’s worth noting that although I’m not discussing it in this particular post, copyright considerations play a significant role in data usage rights. Accordingly, it’s crucial for board members to ensure that their organizations have robust processes for obtaining appropriate intellectual property rights for any data used in AI and machine learning models.
Third-Party Data
If data has been obtained from a third party, rather than from a data subject themselves, the issue of provenance is even more important. Organizations should have an established process by which they document the lineage of third party data, as well as any limitations on that data.
Data lineage or data provenance has become increasingly important, as technology relies more extensively on previously collected or generated data, which is itself becoming more voluminous. Without information about where data originated, how it was obtained, and what has been done to it, users of said data expose themselves and their organizations to the risk of negative financial, legal, and reputational outcomes.
What the Future Looks Like
We’re headed down an interesting path: technology is both enabling the exponential growth of data – which complicates the process of establishing provenance and lineage – and offering potential solutions to managing the very problems it is creating.
The best example of this phenomenon is the proliferation of MLOps platforms like MLflow. Fundamentally, “machine learning operations” platforms provide databases and APIs for the management of datasets and models. The original goal behind such MLOps platforms was to increase efficiency and quality for data engineers and data scientists, but over time, their value from a compliance perspective has become clear.
MLOps systems allow organizations to create and version datasets, including their provenance and lineage. These datasets can then be used to train machine learning models, which are themselves stored, versioned, and even run by these MLOps platforms. If this sounds like tracking provenance, that’s because it is!
Blockchain and Data Provenance
It’s 2022, which means I can’t finish this post without a paragraph on blockchain or DLT. To be fair, however, legal scholars arguably first described blockchain-like systems while solving these exact provenance problems. For example, Nick Szabo first published Secure Property Titles with Owner Authority in 1998, in which he wrote:
“The property is represented by titles: names referring to the property, and the public key corresponding to a private key held by its current owner, signed by the previous owner, along with a chain of previous such titles. Title names may ‘completely’ describe the property, for example allocations in a namespace. (Of course, names always refer to something, the semantics, so such a description is not really complete). Or the title names might simply be labels referring to the property. Various descriptions and rules – maps, deeds, and so on – may be included.”
Two decades later, ideas like this have matured to the point where multiple technical solutions capable of implementing these ideas are available. Whether the future of provenance will live on web3 in a public or private chain remains to be seen, as the cost of storage and adoption may outweigh benefits for at least the foreseeable future. However, given the rate of innovation and unpredictable history of technology adoption, it’s wise not to rule any possibility out completely. The future has a habit of surprising us all.
Board-Level Considerations
Board considerations:
- Understanding and overseeing data provenance practices is an important element of the AI risk management process.
- Mitigate legal, regulatory, and reputational risks associated with data obtained by questionable means.
- Protect the company’s bottom line by safeguarding against the use of tainted or improperly sourced data.
- Ensure reliability and explainability of AI models for regulatory compliance and stakeholder trust.
- Focus on data provenance for internal benchmarking and development.
Conclusion: Provenance as a Cornerstone of Responsible AI Governance
As we’ve seen, data provenance is not just a technical concern but a fundamental aspect of responsible AI governance and risk management. For board members, understanding and overseeing data provenance practices is crucial for ensuring the reliability, legality, and ethical use of AI and machine learning models in your organization.
By prioritizing robust data provenance practices, boards can help their organizations navigate the complex landscape of AI and data ethics, mitigate risks, and create sustainable value through responsible innovation. As the AI landscape continues to evolve, those organizations with strong data provenance foundations will be better positioned to adapt, comply with emerging regulations, and maintain stakeholder trust.
Remember, in the world of AI and data, knowing where your data comes from is just as important as knowing where you’re going with it. As board members, it’s our responsibility to ensure that our organizations are not just data-driven, but provenance-aware.