Is Big Tech wrong to train AI models on 'messy' public data?

Examining Big Tech and ‘Messy’ Data Sets

Interested in more of our content? Sign up for our newsletter. Sign up here:

The rapid growth of Artificial Intelligence technology in many different parts of a business is complicating contractual agreements with customers, 3rd parties as well as mergers and acquisitions (M&A) transactions. Data ownership and licensing related risks may not be currently addressed in existing contract and legal reviews and other due diligence activities including:

  • Training Data: AI systems heavily rely on training data - which is often scraped or open-source data where ownership and data usage rights are not clear. Understanding and documenting training data flows is the first step.

  • Data Quality: AI data pollution occurs when the data used to train or operate AI models is flawed, incomplete, or biased, potentially leading to biased predictions, unreliable recommendations, and inaccurate insights

  • Clarity of Ownership: Determining ownership of training data can be complex and uncertain. It might be subject to claims by third parties, infringement claims, privacy issues or other legal restrictions. This uncertainty could impact not only the use of training data, but also ownership of algorithms built using that data and any synthetic data created.

  • Use Limitations: If training data has use limitations, it can restrict how a company commercializes and licenses the data, develops technology, and applies algorithms.

Synthetic Data

The reliance on public data for training AI models exposes companies to significant copyright and privacy risks, while synthetic data, though a promising alternative, may face limitations in scope and accuracy when derived from insufficient original data. The push for AI models to handle large-scale data creates operational challenges, including data freshness, regulatory pressures, and the need for real-time insights, all of which necessitate robust and secure technology infrastructures to manage and mitigate these risks effectively.

Ali Golshan, CEO and cofounder of Gretel, which allows companies to experiment and build with synthetic data. Golshan says synthetic data is a safer and more private alternative to "messy" public data, and that it can shepherd most companies into the next era of generative AI development.

Access our LinkedIn