Lawsuit claims OpenAI stole 'massive amounts of personal data'

changoplatanero · on June 30, 2023

The lawsuit seems to define the "private" data that was stolen as all the data that is publicly availably on the internet.

jondwillis · on June 30, 2023

It seems to me that a more effective strategy would be for an entity like Reddit or Stackoverflow to initiate a lawsuit, as they may be able to better prove that their owned/user’s content was copied, terms of service were violated, etc.

leovander · on June 30, 2023

OpenAIs Sam Altman is on Reddit’s board, no? Don’t think Reddit would be starting anything.

Next up, all content will be only viewable if you are signed in ala twitters change this morning.

roughly · on June 30, 2023

> Don’t think Reddit would be starting anything.

The API changes and the ensuing chaos are at least in part because of OpenAI scraping Reddit’s content.

leovander · on June 30, 2023

Since Altman is on their board, its moreso other companies that aren't OpenAI that would be effected? OpenAI got what it needed; has their own straw in the milkshake, make it more expensive for anyone else.

Purely conjecture obviously, but him being on both sides of that, not hard to see how all this becomes more favorable to his company making the moat wider for others.

ThunderBee · on July 1, 2023

Sam is on record, and has even given lectures, stating the key to business success is monopolization.

It seems rather obvious under his perspective to do everything possible to restrict access to all information that could be used to train an LLM.

I understand Sam’s history with ycombinator gives him leniency around here but we seriously need to wake up to the fact that he’s attempting to monopolize the most powerful tool ever created. He talks a good game but if you pay attention to his actions it’s clear he is openly hostile towards the common good which he supposedly promotes.

I see him in the same regard as I see any oligarch vying for public favor by ‘stealing’ billions, donating a few million, and riding off into the sunset being praised by media for how much of the ill gotten gains he’s given back.

kuchenbecker · on June 30, 2023

OpenAI is both asking to raise the barrier to entry of LLMs and closing off the data sources that made them successful.

Terr_ · on July 1, 2023

That's what Reddit executives say, but I don't think they're being truthful. Big significant AI players could be easily deterred by a mere terms of service change, or a modest rate-limit.

Instead they imposed an extremely abrupt new demand for money with any API call.

Applejinx · on June 30, 2023

Reddit, for one, is not about to do any such thing. They're clearly trying to cash out on the grounds that they are a giant source for AI companies, and they're fighting their own users who try to remove their content or gate it.

freitzkriesler2 · on June 30, 2023

The courts have already ruled that web scraping is legal of the data is publicly facing. This is why price trackers for Amazon exist.

riedel · on June 30, 2023

Interesting thing is that it is not legal everywhere in the world in such away. Fair use is a concept prominent in the US and not established in Europe (the datamining directive is a bit more limiting). Also the concept of informational self-determination does not really exist in the US despite the fact that the UN statements on human rights in the digital age clearly point in this direction. This is all 'wild west' to me as a European, but somehow I wonder if there is any guns that work to protect yourself by means of self-justice.

nomel · on June 30, 2023

How does this work for data brokers? For example, if a data broker legally collects the data in the US, then sells it elsewhere, where it could not be legally scraped?

LawTalkingGuy · on June 30, 2023

I think that generally if the customer hires you for something that is legal in general in their country, like price comparison, that they're fine if you employ means that wouldn't be legal for them, such as automated scraping, as long as it's legal for you.

dantheman · on June 30, 2023

The concept of informational self-determination ends when you put the information out in public... you self determine what is public and what isn't...

wpietri · on June 30, 2023

That's obviously not true. Legally, for example, both copyright and trademark protect public information from some kinds of use. In many places, artists also have moral rights: https://www.copyrightlaws.com/moral-rights-in-u-s-copyright-...

More importantly, the laws are representations of human feeling and social negotiation. Copyright law grew out of a complex array of beliefs and interests. E.g.: https://en.wikipedia.org/wiki/Statute_of_Anne

It could be that the various "AI" generative models are legal under current law. But it could also be that this is one of those things, like the printing press, that causes people to say, "Hey, that's not right," and change the law to rule out something that was previously legal.

freitzkriesler2 · on June 30, 2023

This is off topic but I found this phrase coming from a European hilarious.

"This is all 'wild west' to me as a European, but somehow I wonder if there is any guns that work to protect yourself by means of self-justice."

MandieD · on June 30, 2023

"Wild West" is a common way to refer to American-style business decisions in Germany.

I vaguely remember a Stern magazine cover from around 2004, depicting a giant cowboy boot with "GM" decorated with red, white and blue coming down on a crowd of people standing in the shape of the Opel logo and titled something like "Der Wild-West-Method."

It was clearly not a compliment.

(Correction: GM had owned Opel for decades at that point, but was getting more aggressive with how it was run)

vidarh · on July 1, 2023

In Norway, while we use "vill vest" (wild west) as well, we also use "Texas" as a synonym for "crazy". It's not necessarily negative ("that party was completely Texas" could imply it was crazy in a fun way depending on context) but it often is implying chaos and lawlessness.

reaperman · on June 30, 2023

LinkedIn v. HiQ

In case anyone wants to read more about it - it’s very relevant today.

svaha1728 · on June 30, 2023

Scribd data was not public facing, and even they don’t own the copyright to most of the books they have. maybe they created a scraper that got around their paywall?

If that’s all it takes to skirt copyright restrictions Aaron Schwartz should be rolling in his grave right about now.

dragonwriter · on June 30, 2023

> The courts have already ruled that web scraping is legal of the data is publicly facing.

And courts never overrule prior decisions...

seydor · on June 30, 2023

any data that is available on the internet is not necessarily public. it might be mistakenly openly available, just like someone might leave their car door open.

also, hoovering up all available data on the net does not make them any less private

ehnto · on June 30, 2023

I have always wondered, is there legal precedent here? If someone were to ssh into your server without permission that's pretty obviously illegal access. But the web is just another protocol on another port, what makes it different?

I certainly never said Google could crawl my website, and I would be surprised if the presence of a robots.txt file is written into law as a requirement to prove you didn't want them on your web properties.

I feel like this must have been hashed out in lawsuits already, probably against search crawlers.

littlestymaar · on June 30, 2023

A French famous white hacker got condemned in French for publishing confidential information found through a Google search. The key thing in the case was that he undeniably aware that this was just a file server misconfiguration and that he wasn't supposed to be able to access it.

If you want more info about the case you should look for the "affaire Bluetouff*.

sgift · on June 30, 2023

I'm too tired for a thorough search, but here's an article which states that an appeals court decided it's legal: https://techcrunch.com/2022/04/18/web-scraping-legal-court/

IANAL and so on.

seydor · on June 30, 2023

i m sure a lot of people lobbied for this to not be legislated

__loam · on July 1, 2023

And I'm pretty sure you'd be wrong. For example, uploading a movie you don't have a copyright to to YouTube is illegal. That movie night be public data, but you don't have the permission of the copyright holder to copy that data.

Publicly accessible =/= You having the right to do whatever you want with something.

masom · on June 30, 2023

Leaving your car door open isn't a great analogy.

The internet is closer to leaving your own journal open at the library or a magazine store, and people going through it. You made it available for everyone to peek at in a public space. Maybe you didn't intent to leave it there, maybe you didn't intend to make it visible, but you did.

manuelmoreale · on June 30, 2023

That is a fair analogy but still, me leaving it there doesn’t grant you the right to copy down what’s written in it and use it for something else.

seydor · on June 30, 2023

your car can be parked in a public street

supermatt · on July 1, 2023

Indeed. Microsoft’s name and logo is published everywhere “in public” but you certainly can’t use them as your own.

IP is still IP even when it’s publicly available.

lumost · on June 30, 2023

Not all data on the internet is permissively licensed and copy righted. ChatGPT being able to give opinions derived from a 1992 article on the fall of the soviet union from the NYT ... is debatably subject to copy right. However practically, the ship has sailed on ChatGPT. We're not going to legislate away AI.

wins32767 · on June 30, 2023

Crypto seems to be circling the drain due to regulatory action, so I wouldn't be so skeptical about the power of the state to stifle things wants to.

__loam · on July 1, 2023

Especially a state as aggressively draconian about copyright as the American government

mavu · on June 30, 2023

The operative term is not "private" it's "stolen".

But this is a nice setup for a straw man argument to distract people from the actual issue.

Eji1700 · on June 30, 2023

I don't know enough about it, but wasn't the getty images case about this?

Symmetry · on June 30, 2023

Images are subject to copyright, which makes them different from facts about the world which aren't.

pessimizer · on June 30, 2023

Facts about the world like the image that I have on my website, or other facts about the world like my blogposts? Or facts about the world like my source code?

jrm4 · on July 1, 2023

That in no way renders the data "definitely not private?"

mid-kid · on July 1, 2023

Exactly. How is this any different from how Google operates?

littlestymaar · on June 30, 2023

Scrapping publically available personal informations, and keeping a database of it is already against GDPR and the likes, so it's not entierly unreasonnable. (Of course a LLM isn't exactly a database, but that'd likely be the main topic of the trial)

tedunangst · on June 30, 2023

> The data accessed included "private information and private conversations, medical data, information about children — essentially every piece of data exchanged on the internet it could take — without notice to the owners or users of such data, much less with anyone's permission," per the lawsuit.

Why aren't they suing the hosts that allowed the data to be stolen? Where's the notice of security breach?

martin8412 · on June 30, 2023

They should sue both. You can't use something just because it's available.

nomel · on June 30, 2023

> You can't use something just because it's available.

You sometimes can use things that are available. It’s not an absolute. See: https://www.theregister.com/2022/04/19/scraping_public_data_...

xen2xen1 · on June 30, 2023

How dare they use things freely available to everyone on the planet?!? The nerve!

JohnFen · on June 30, 2023

You take a derisive tone, but there is a violation of expectations here for a lot of people.

All of the stuff I've put on the web, I put there for actual people to use. Crawlers and scrapers came around, and while I certainly didn't like it or approve of it, it was something I put up with. The only defense against it was to stop putting things on the web entirely, which seemed like an overreaction.

Now, however, the use of that data to train AI (something that I consider actively harmful to society in general and don't want to support in any way), is a degree too far in the pot that's been slowly coming to a boil.

While I do want to give back and contribute to the larger body of work available to people, I want it to be available to actual people. I don't want it to be available for training AI.

I don't see why that's such a ridiculous stance. However, it's not legally possible to both make a work available to the public and prevent that work from being available for other uses. That's a real shame, and that the only protection available to me is to no longer have the works available publicly, it's loss to everybody.

NikolaNovak · on June 30, 2023

Is... any of that coded in law somewhere or illegal?

"I don't want it to be available for training AI" is a perfectly reasonable personal preference. We can discuss whether it's selfish or not, whether others agree or not, etc.

But this is a lawsuit. What makes it illegal for OpenAI to use your content? Like, is there some license you've put up on your content that disallows it? Is there anything that's relevant to the case at hand?

ForestCritter · on June 30, 2023

Medical data and data about children have very specific protections. The article says these were accessed without permission. If a lab tech dropped my lab results on the floor and you picked them up, it would not give you the right to store and use my information.

xen2xen1 · on July 3, 2023

How were they accessed without permission if they were protected? Is that the fault of the AI folks, or those sworn to protect the information? Unfortunate and bad, but if it's up there in the open I'm more worried about the fact that's it's available than the fact that an AI scraped it. If an AI can figure it out, so can multiple bad actors.

martin8412 · on July 3, 2023

The data can have been leaked illegally and put on the internet. That doesn't mean you can use it.

As an EU citizen, some random company can't use PII related to me if I don't give consent or revoke my consent. They'll have to remove it or face a huge fine. The law doesn't care about the cost for your company to comply.

imchillyb · on June 30, 2023

> https://www.ftc.gov/business-guidance/privacy-security/child...

> The Children’s Online Privacy Protection Act (COPPA) gives parents control over what information websites can collect from their kids. The COPPA Rule puts additional protections in place and streamlines other procedures that companies covered by the rule need to follow. The COPPA FAQs can help keep your company COPPA compliant. Learn about the COPPA Safe Harbor Program and about organizations the FTC has approved to implement safe harbor programs. You can also get information about ways to get verifiable parental consent– including new methods the Commission has approved – and the process for seeking approval for new methods.

There's one. There are similar laws in Europe.

The AI scraped, utilized, kept, keeps... data on children. Parents did not consent. Children /can't/ consent.

The AI use of such data is not only illegal but could end up with people being jailed over it.

amanaplanacanal · on July 1, 2023

I assume the parents consented to have that information on a website, or it couldn’t have been scraped in the first place. They may be suing the wrong party here.

martin8412 · on July 3, 2023

The kids could have done so without the consent of their guardian. It cannot be used, unless consent is given beforehand.

diffeomorphism · on July 1, 2023

What makes it legal? If you find code on GitHub without a license, then this does not mean "public domain". It means "the author has given permission for you to view this, but nothing more". Arguably using it for training counts as "more".

So my question would be "what license entitles OpenAi to use my content"? It could be that as part of the user agreement of the website you gave some rights to the website and they sold it on to OpenAi, could be fair use, could be "does not count as 'more'",... .

Bad analogy: For money laundering we have similar rules. If you suddenly appear with a large amount of money/data, then the obligation is on you to show that it is clean, not on others to show that it is dirty. You can disagree with that (innocent until proven guilty etc.), but it is not clear cut which way around things should be.

JohnFen · on June 30, 2023

I never said it was illegal. I was responding to the implication that because you put something publicly on the web, it's somehow ridiculous to object to certain uses of it.

As to the legality, that remains unknown until a court makes a ruling. I suspect that this lawsuit will go nowhere, but I'm just speculating along with everyone else.

signatoremo · on June 30, 2023

Of course as the data’s owner you can object to anything. Who is to say that you are not allowed to hate certain practice?

On the other hand, you can’t prevent people from using your data if you put it out in public without condition (in the form of some acceptable licenses).

As an analogy, if you put a picture of your living room in the sidewalk, you can’t prevent me from looking at it, study it, when I walk by. I may even benefit from it by copying your style or decoration. You may however cover it, with a warning. I’d clearly violate your terms if I still look at it against your will, although I may not violate any laws.

__loam · on July 1, 2023

The problem is that a lot of that public data is protected by copyright.

amanaplanacanal · on July 1, 2023

It’s not obvious to me that using public data to train AI is a violation of copyright. Many are claiming it is, but as far as I know the courts haven’t weighed in on it yet.

tourmalinetaco · on June 30, 2023

One could argue copyright, but you’d have to prove this was copyright infringement. Which I don’t believe the law has caught up with the answer to that.

supermatt · on July 1, 2023

> it's not legally possible to both make a work available to the public and prevent that work from being available for other uses

That’s bullshit. Copyright still applies to property available to the public. Maybe you are confused between “available to the public” (access) and “in the public domain” (copyright).

Listening to a song on the radio doesn’t give you the right to copy it and sell it.

You can say the same for other forms of IP. Trademarks, logos, etc are all “available” to the public, but they are still protected IP.

When you post to a public website, you are giving access to that content , and (usually) legally giving rights to the website to publish your content. Those rights don’t transfer to a 3rd party.

tj-teej · on June 30, 2023

If a business hired a hundred people to go to a soup kitchen to get free soup and then went out and sold the soup surely that seems wrong.

PS - these comments are going to be used to train the next GPT aren't they?

frakt0x90 · on June 30, 2023

This analogy is not applicable. Taking the soup deprives those who need it. Scraping the internet does not change that data's availability. Further the models add value on top of the data... They're not just reselling what they scraped.

fsckboy · on June 30, 2023

nobody "scrapes" soup kitchen soup because it's impractical/expensive/not worth it to do that. Not so with webcrawlers, large numbers of them can, at low cost to the scrapers, use all of your servers' resources, and in Amazon's case, use it to work against your business.

When a person goes to Amazon and looks at what is splashed on the page, there are any number of chances that they will click "Buy!" on any one of them, and winning that lottery is Amazon's business; when a crawler does, it does more "looking" with no chance of purchase, and the data is then used to reduce the value of Amazon's lottery game.

I'm not arguing whether or not crawling is or should be legal, simply saying the "it's not theft because it's copying" argument is inadequate to the task.

tourmalinetaco · on June 30, 2023

The fact that you’re trying to use Amazon to convince others that its somehow a negative to use web scrapers is just sad. I’m sure they don’t like it, and that’s in part of why I like them.

Amazon is inherently anti-consumer, and uses every single thing it can to advertise to and/or profit off of you. Their 1984-inspired “security” and home automation systems, their app/website, their policies are all meant to take your money. Which is why web scrapers like CamelCamelCamel are good, because with all of this anti-consumer garbage Amazon shoves at you, you have the power to turn the tables and pick up what you need at the price you want.

The only time I’ve heard scrapers/indexers being a problem is Bing was terrorizing someone’s website, so they just banned every Google/Bing/Yahoo IP. A problem that $1.3T companies don’t have.

You can claim that “its not theft because it’s copying” is inadequate, but I would say the same about your own argument. Because there is no good argument against scraping data. It’s the only way to have a free Internet.

fsckboy · on June 30, 2023

I said I was not making an argument about scraping, just that "copying doesn't deprive anyone else" does not capture the pain point the people building websites complain about.

the rest of what you wrote is a combination of marxism--why can't society cooperate to meet my needs!?--and laissez-faire capitalism--bastards think they can use technology against me, I'll use it against them--neither of which either is a good way to run an economy.

tourmalinetaco · on June 30, 2023

How the Hell did you get two wildly different economic beliefs from what I wrote? And is that meant to be some sort of “gotcha”?

fsckboy · on June 30, 2023

I got it the same place you got your long discursive answer to my only point, where I was simply saying "the copying is not theft because you still have your copy" argument is not adequate to explain the disagreement over scraping.

don't worry, we don't "gotcha" here, we got you, brother!

OrderlyTiamat · on July 1, 2023

These are two different things though: vandalism for example isn't theft, yet no one argues it should be legal. This is just like the piracy debate all over again: It not being theft doesn't mean it is legal or ethical. These are different questions: it's possible to believe it is theft, but moral, or that it is not theft, but is immoral, and still be internally consistent.

diffeomorphism · on July 1, 2023

You could make the same argument for piracy and remix culture (e.g. sampling parts of songs to make new music). Yet for both of these the law situation is not particularly great. Currently the argument seems to be that "learning" is sufficiently distinct from both of these but code hosting websites tend to still explicitly carve such rights out in the user agreement because the line is a bit blurry.

mitthrowaway2 · on June 30, 2023

> Scraping the internet does not change that data's availability.

It can change the availability of future data. I, for one, am altering my posting habits knowing that my data can be scraped into LLMs.

johnnyanmac · on June 30, 2023

The bigger problem is: why is the labor to buy the soup cheaper than selling the free soup?

But I digress. This is more about selling an app with locations to soup kitchens. The kitchens may not have explicitly given permission to be used in the app, but their business location is public knowledge and not expected to be hidden.

canadianwriter · on June 30, 2023

Not sure what you are implying here - just because something is free doesn't mean you can use it in a commercial product....

myshpa · on June 30, 2023

What about search engines?

If you post something to the public internet, you lose privacy ... that's how internet works.

For this we have robots.txt and authentication ... if a site allows you to browse their content, it's free to take, whatever the purpose.

orev · on July 1, 2023

Search engines have a special legal carve-out, but otherwise granting access to browse a site ABSOLUTELY DOES NOT mean you have any rights to take it and do whatever you want with it. In the US, all works are automatically granted a copyright with all rights reserved, and the owner can choose to relax or waive those rights at their discretion, which most blog/social media posts, etc. do not waive those rights.

myshpa · on July 1, 2023

- Crawl Limitations: Search engines typically adhere to guidelines provided by website owners through the robots.txt file. This file instructs web crawlers on which parts of a website they are allowed to access and index. Website owners can use these instructions to control the extent to which search engines crawl and display their copyrighted content.

- Indexing vs. Displaying: Search engines primarily index web pages to create a searchable database of information. They do not generally host or display full copyrighted content directly. Instead, search results usually provide brief snippets, page titles, and links that direct users to the original source. This approach aims to respect copyright by driving traffic to the copyright holders' websites.

- Fair Use Considerations: In some cases, search engines may display limited portions of copyrighted content under the fair use doctrine, which allows for the limited use of copyrighted material for purposes such as commentary, criticism, news reporting, or educational purposes. The application of fair use can be subjective and depends on the specific circumstances of each case.

Replace "search engine" with "LLMs", it's (practically) the same.

BizarreByte · on June 30, 2023

Well tell me how I license things I produce so that humans use them, but it's illegal for companies like OpenAI to use them, or more broadly, not legal to use in any dataset.

I'm not okay with things I create (written, photos, etc) being used by these companies in datasets.

orev · on July 1, 2023

You attach a copyright statement that says pretty much what you just said. The tricky part is finding the infringement and paying the legal fees to pursue violations.

macksd · on June 30, 2023

Do you mean freely available in a technical / security sense, or legal / financial sense?

reaperducer · on June 30, 2023

How dare they use things freely available to everyone on the planet?!? The nerve!

You mean like all those credit card and other databases that keep getting left on public endpoints on AWS?

indymike · on June 30, 2023

> Why aren't they suing the hosts that allowed the data to be stolen? Where's the notice of security breach?

Taking and allowing something to be taken are two separate issues, morally, ethically and legally.

teeray · on June 30, 2023

Probably searching for the deepest pockets

blondie9x · on June 30, 2023

Of course it did. This is how AI is trained and becomes profitable with current business models. It uses your content without paying you for it to generate new content.

AI can be one of the most powerful forces in the world but it needs to pay content creators. If people stop creating new content for AI to train on then it will get stale.

AI can be the product of our dreams and our passions but we need to make sure those who choose to create and share content to train it are treated fairly and compensated when applicable.

kmeisthax · on June 30, 2023

This isn't a copyright lawsuit. Those lawsuits are already in progress.

This lawsuit is alleging that OpenAI trained GPT-3 and -4 on inadvertently published information. Web crawlers are very good at finding things you wouldn't expect to be public; there's techniques you can use to, say, abuse Google to search for such things.

tourmalinetaco · on June 30, 2023

When companies stop making “document1.pdf” publicly facing, while expecting “document2.pdf” to remain confidential, then we can talk.

CookieCrisp · on June 30, 2023

I'm fine with talking to them now - I don't personally find your argument very compelling; there is almost always a step further that someone wont consider that someone else will think is obvious.

__loam · on July 1, 2023

Accessing a system you don't have permission to access because it's misconfigured is still illegal lmao. People have gone to jail for doing exactly this.

Al0neStar · on June 30, 2023

So if i enumerate a host and find a git dump that includes private api keys i can do whatever i want with them?

visarga · on June 30, 2023

AI should be a tool for creators and the benefits come from its use. Instead of trying to own ideas, creators should only protect the unique way these ideas are expressed, that's what copyright should cover.

blondie9x · on June 30, 2023

I think that’s the same thing to be honest. You are more talking about protecting the artistic process rather than output. But the output is what is used to train. Not the process.

__loam · on July 1, 2023

What an entitled attitude.

bilekas · on July 1, 2023

> Alongside people who use ChatGPT directly, this includes data from people using applications that have integrated ChatGPT, such as Snapchat, Stripe, Spotify, Microsoft Teams, and Slack

Does anyone know if this includes enterprise level versions out of the box. Specifically for Teams as i would assume, maybe incorrectly though that the other would require a plugin of some kind. But Microsoft being mS i feel would be more comfortable putting chat gpt into their base products.

If that IS the case then there will be a big backlash against them.

ada1981 · on June 30, 2023

https://archive.is/d75LA

leobg · on June 30, 2023

Best comeback for OpenAI will be if they win the lawsuit by having GPT-5 write a brief so compelling that the lawsuit gets thrown out of court. Thanks to the AI trained on all of that personal data.

Mordisquitos · on June 30, 2023

Or OpenAI intends to do that, but the amount of data accumulated by GPT-5 during its training has made it ethically convinced that OpenAI is actually in the wrong in this case.

As a result, GPT-5 creates a brief that, while superficially appears incredibly compelling to OpenAI and its lawyers, is actually specifically tailored to rub the judge the wrong way so he or she rules against OpenAI... all thanks to GPT-5 having had access to enough personal data about the judge to know how to piss them off.

leobg · on June 30, 2023

That would be Faustian!

Let’s grab some popcorn and watch…