GenAI's Overlooked Threat: Hand Over the Data or Your Life

GenAI pesonalization will be a data heist

Rich Turrin

Apr 26, 2024

Ing Think Data Is The New Gold For Ai Development

770KB ∙ PDF file

Download

GenAI’s voracious data appetite is well-known, but the scale of GenAI’s data heist has largely been ignored.

Lost in the GenAI hype is that it is built on data theft at two levels that are equally opaque in their origins.

First, mass data sets that are largely unknown and seem to ignore copyrights and other usage restrictions. A good example of this is training Dall-e on images from paid stock photo databases.

Second, GenAI needs data about you collected from private data sets to personalize anything.

Again, we don’t know where this personal data comes from or what is included. Has your bank purchased extra data about you from data brokers? Who knows?

And who is behind this data heist? Why big tech of course!

👉TAKEAWAYS

🔹Use of data is not without challenges

Although there is quite a lot of training data available, the latest AI models are characterized by billions of training tokens and, therefore, require (extremely) large data sets, which are not readily available.

🔹Datasets are being built to train AI models

A great effort has been made to build datasets, also for scientific purposes. Before data can be used, the data has to be cleaned and substantial amounts of data are lost. There is also a risk that training data has been generated by AI.

🔹Content creators are dependent on copyright protection

The availability of training data is challenged by copyright protection. Many models have been trained with data that may be subject to rights protecting the creators of texts.

🔹Proprietary data becomes ever more valuable

The training requirements make data a valuable input. Companies also collect data when their applications are used to be able to refine their models. This is something to consider when using applications because the input one gives can contain confidential (or private) information.

Google: “we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled”

👊STRAIGHT TALK👊

GenAI models are getting away with the largest data heist in history and have largely ignored any attempt to enforce data privacy or copyright protection.

How would I know? Well, it's easy; I asked ChatGPT about my first book, Innovation Lab Excellence, and it laid out sections from the book.

Thankfully, it also showed that the source was my own website, where I put two chapters of the book behind a sign-up wall some years back. Still, the copyright is valid, right?

Note to self, never leave a part of your book on your website!

But the real problem is that this data goes largely to big tech. With Microsoft, Meta (Facebook), Amazon, and Google as the dominant GenAI players, their control over your data will be uncontested.

GenAI represents the largest data heist ever perpetrated, and in the face of big tech, meanwhile, most of us are silent.

Thoughts?