It seems to me that a more effective strategy would be for an entity like Reddit or Stackoverflow to initiate a lawsuit, as they may be able to better prove that their owned/user’s content was copied, terms of service were violated, etc.
Since Altman is on their board, its moreso other companies that aren't OpenAI that would be effected? OpenAI got what it needed; has their own straw in the milkshake, make it more expensive for anyone else.
Purely conjecture obviously, but him being on both sides of that, not hard to see how all this becomes more favorable to his company making the moat wider for others.
Sam is on record, and has even given lectures, stating the key to business success is monopolization.
It seems rather obvious under his perspective to do everything possible to restrict access to all information that could be used to train an LLM.
I understand Sam’s history with ycombinator gives him leniency around here but we seriously need to wake up to the fact that he’s attempting to monopolize the most powerful tool ever created. He talks a good game but if you pay attention to his actions it’s clear he is openly hostile towards the common good which he supposedly promotes.
I see him in the same regard as I see any oligarch vying for public favor by ‘stealing’ billions, donating a few million, and riding off into the sunset being praised by media for how much of the ill gotten gains he’s given back.
That's what Reddit executives say, but I don't think they're being truthful. Big significant AI players could be easily deterred by a mere terms of service change, or a modest rate-limit.
Instead they imposed an extremely abrupt new demand for money with any API call.
Reddit, for one, is not about to do any such thing. They're clearly trying to cash out on the grounds that they are a giant source for AI companies, and they're fighting their own users who try to remove their content or gate it.
Interesting thing is that it is not legal everywhere in the world in such away. Fair use is a concept prominent in the US and not established in Europe (the datamining directive is a bit more limiting). Also the concept of informational self-determination does not really exist in the US despite the fact that the UN statements on human rights in the digital age clearly point in this direction. This is all 'wild west' to me as a European, but somehow I wonder if there is any guns that work to protect yourself by means of self-justice.
How does this work for data brokers? For example, if a data broker legally collects the data in the US, then sells it elsewhere, where it could not be legally scraped?
I think that generally if the customer hires you for something that is legal in general in their country, like price comparison, that they're fine if you employ means that wouldn't be legal for them, such as automated scraping, as long as it's legal for you.
More importantly, the laws are representations of human feeling and social negotiation. Copyright law grew out of a complex array of beliefs and interests. E.g.: https://en.wikipedia.org/wiki/Statute_of_Anne
It could be that the various "AI" generative models are legal under current law. But it could also be that this is one of those things, like the printing press, that causes people to say, "Hey, that's not right," and change the law to rule out something that was previously legal.
"Wild West" is a common way to refer to American-style business decisions in Germany.
I vaguely remember a Stern magazine cover from around 2004, depicting a giant cowboy boot with "GM" decorated with red, white and blue coming down on a crowd of people standing in the shape of the Opel logo and titled something like "Der Wild-West-Method."
It was clearly not a compliment.
(Correction: GM had owned Opel for decades at that point, but was getting more aggressive with how it was run)
In Norway, while we use "vill vest" (wild west) as well, we also use "Texas" as a synonym for "crazy". It's not necessarily negative ("that party was completely Texas" could imply it was crazy in a fun way depending on context) but it often is implying chaos and lawlessness.
Scribd data was not public facing, and even they don’t own the copyright to most of the books they have. maybe they created a scraper that got around their paywall?
If that’s all it takes to skirt copyright restrictions Aaron Schwartz should be rolling in his grave right about now.
any data that is available on the internet is not necessarily public. it might be mistakenly openly available, just like someone might leave their car door open.
also, hoovering up all available data on the net does not make them any less private
I have always wondered, is there legal precedent here? If someone were to ssh into your server without permission that's pretty obviously illegal access. But the web is just another protocol on another port, what makes it different?
I certainly never said Google could crawl my website, and I would be surprised if the presence of a robots.txt file is written into law as a requirement to prove you didn't want them on your web properties.
I feel like this must have been hashed out in lawsuits already, probably against search crawlers.
A French famous white hacker got condemned in French for publishing confidential information found through a Google search. The key thing in the case was that he undeniably aware that this was just a file server misconfiguration and that he wasn't supposed to be able to access it.
If you want more info about the case you should look for the "affaire Bluetouff*.
And I'm pretty sure you'd be wrong. For example, uploading a movie you don't have a copyright to to YouTube is illegal. That movie night be public data, but you don't have the permission of the copyright holder to copy that data.
Publicly accessible =/= You having the right to do whatever you want with something.
The internet is closer to leaving your own journal open at the library or a magazine store, and people going through it. You made it available for everyone to peek at in a public space. Maybe you didn't intent to leave it there, maybe you didn't intend to make it visible, but you did.
Not all data on the internet is permissively licensed and copy righted. ChatGPT being able to give opinions derived from a 1992 article on the fall of the soviet union from the NYT ... is debatably subject to copy right. However practically, the ship has sailed on ChatGPT. We're not going to legislate away AI.
Facts about the world like the image that I have on my website, or other facts about the world like my blogposts? Or facts about the world like my source code?
Scrapping publically available personal informations, and keeping a database of it is already against GDPR and the likes, so it's not entierly unreasonnable. (Of course a LLM isn't exactly a database, but that'd likely be the main topic of the trial)
> The data accessed included "private information and private conversations, medical data, information about children — essentially every piece of data exchanged on the internet it could take — without notice to the owners or users of such data, much less with anyone's permission," per the lawsuit.
Why aren't they suing the hosts that allowed the data to be stolen? Where's the notice of security breach?
You take a derisive tone, but there is a violation of expectations here for a lot of people.
All of the stuff I've put on the web, I put there for actual people to use. Crawlers and scrapers came around, and while I certainly didn't like it or approve of it, it was something I put up with. The only defense against it was to stop putting things on the web entirely, which seemed like an overreaction.
Now, however, the use of that data to train AI (something that I consider actively harmful to society in general and don't want to support in any way), is a degree too far in the pot that's been slowly coming to a boil.
While I do want to give back and contribute to the larger body of work available to people, I want it to be available to actual people. I don't want it to be available for training AI.
I don't see why that's such a ridiculous stance. However, it's not legally possible to both make a work available to the public and prevent that work from being available for other uses. That's a real shame, and that the only protection available to me is to no longer have the works available publicly, it's loss to everybody.
Is... any of that coded in law somewhere or illegal?
"I don't want it to be available for training AI" is a perfectly reasonable personal preference. We can discuss whether it's selfish or not, whether others agree or not, etc.
But this is a lawsuit. What makes it illegal for OpenAI to use your content? Like, is there some license you've put up on your content that disallows it? Is there anything that's relevant to the case at hand?
Medical data and data about children have very specific protections. The article says these were accessed without permission. If a lab tech dropped my lab results on the floor and you picked them up, it would not give you the right to store and use my information.
How were they accessed without permission if they were protected? Is that the fault of the AI folks, or those sworn to protect the information? Unfortunate and bad, but if it's up there in the open I'm more worried about the fact that's it's available than the fact that an AI scraped it. If an AI can figure it out, so can multiple bad actors.
The data can have been leaked illegally and put on the internet. That doesn't mean you can use it.
As an EU citizen, some random company can't use PII related to me if I don't give consent or revoke my consent. They'll have to remove it or face a huge fine. The law doesn't care about the cost for your company to comply.
> The Children’s Online Privacy Protection Act (COPPA) gives parents control over what information websites can collect from their kids. The COPPA Rule puts additional protections in place and streamlines other procedures that companies covered by the rule need to follow. The COPPA FAQs can help keep your company COPPA compliant. Learn about the COPPA Safe Harbor Program and about organizations the FTC has approved to implement safe harbor programs. You can also get information about ways to get verifiable parental consent– including new methods the Commission has approved – and the process for seeking approval for new methods.
There's one. There are similar laws in Europe.
The AI scraped, utilized, kept, keeps... data on children. Parents did not consent. Children /can't/ consent.
The AI use of such data is not only illegal but could end up with people being jailed over it.
I assume the parents consented to have that information on a website, or it couldn’t have been scraped in the first place. They may be suing the wrong party here.
What makes it legal? If you find code on GitHub without a license, then this does not mean "public domain". It means "the author has given permission for you to view this, but nothing more". Arguably using it for training counts as "more".
So my question would be "what license entitles OpenAi to use my content"? It could be that as part of the user agreement of the website you gave some rights to the website and they sold it on to OpenAi, could be fair use, could be "does not count as 'more'",... .
Bad analogy: For money laundering we have similar rules. If you suddenly appear with a large amount of money/data, then the obligation is on you to show that it is clean, not on others to show that it is dirty. You can disagree with that (innocent until proven guilty etc.), but it is not clear cut which way around things should be.
I never said it was illegal. I was responding to the implication that because you put something publicly on the web, it's somehow ridiculous to object to certain uses of it.
As to the legality, that remains unknown until a court makes a ruling. I suspect that this lawsuit will go nowhere, but I'm just speculating along with everyone else.
Of course as the data’s owner you can object to anything. Who is to say that you are not allowed to hate certain practice?
On the other hand, you can’t prevent people from using your data if you put it out in public without condition (in the form of some acceptable licenses).
As an analogy, if you put a picture of your living room in the sidewalk, you can’t prevent me from looking at it, study it, when I walk by. I may even benefit from it by copying your style or decoration. You may however cover it, with a warning. I’d clearly violate your terms if I still look at it against your will, although I may not violate any laws.
It’s not obvious to me that using public data to train AI is a violation of copyright. Many are claiming it is, but as far as I know the courts haven’t weighed in on it yet.
One could argue copyright, but you’d have to prove this was copyright infringement. Which I don’t believe the law has caught up with the answer to that.
> it's not legally possible to both make a work available to the public and prevent that work from being available for other uses
That’s bullshit. Copyright still applies to property available to the public. Maybe you are confused between “available to the public” (access) and “in the public domain” (copyright).
Listening to a song on the radio doesn’t give you the right to copy it and sell it.
You can say the same for other forms of IP. Trademarks, logos, etc are all “available” to the public, but they are still protected IP.
When you post to a public website, you are giving access to that content , and (usually) legally giving rights to the website to publish your content. Those rights don’t transfer to a 3rd party.
This analogy is not applicable. Taking the soup deprives those who need it. Scraping the internet does not change that data's availability. Further the models add value on top of the data... They're not just reselling what they scraped.
nobody "scrapes" soup kitchen soup because it's impractical/expensive/not worth it to do that. Not so with webcrawlers, large numbers of them can, at low cost to the scrapers, use all of your servers' resources, and in Amazon's case, use it to work against your business.
When a person goes to Amazon and looks at what is splashed on the page, there are any number of chances that they will click "Buy!" on any one of them, and winning that lottery is Amazon's business; when a crawler does, it does more "looking" with no chance of purchase, and the data is then used to reduce the value of Amazon's lottery game.
I'm not arguing whether or not crawling is or should be legal, simply saying the "it's not theft because it's copying" argument is inadequate to the task.
The fact that you’re trying to use Amazon to convince others that its somehow a negative to use web scrapers is just sad. I’m sure they don’t like it, and that’s in part of why I like them.
Amazon is inherently anti-consumer, and uses every single thing it can to advertise to and/or profit off of you. Their 1984-inspired “security” and home automation systems, their app/website, their policies are all meant to take your money. Which is why web scrapers like CamelCamelCamel are good, because with all of this anti-consumer garbage Amazon shoves at you, you have the power to turn the tables and pick up what you need at the price you want.
The only time I’ve heard scrapers/indexers being a problem is Bing was terrorizing someone’s website, so they just banned every Google/Bing/Yahoo IP. A problem that $1.3T companies don’t have.
You can claim that “its not theft because it’s copying” is inadequate, but I would say the same about your own argument. Because there is no good argument against scraping data. It’s the only way to have a free Internet.
I said I was not making an argument about scraping, just that "copying doesn't deprive anyone else" does not capture the pain point the people building websites complain about.
the rest of what you wrote is a combination of marxism--why can't society cooperate to meet my needs!?--and laissez-faire capitalism--bastards think they can use technology against me, I'll use it against them--neither of which either is a good way to run an economy.
I got it the same place you got your long discursive answer to my only point, where I was simply saying "the copying is not theft because you still have your copy" argument is not adequate to explain the disagreement over scraping.
don't worry, we don't "gotcha" here, we got you, brother!
These are two different things though: vandalism for example isn't theft, yet no one argues it should be legal. This is just like the piracy debate all over again: It not being theft doesn't mean it is legal or ethical. These are different questions: it's possible to believe it is theft, but moral, or that it is not theft, but is immoral, and still be internally consistent.
You could make the same argument for piracy and remix culture (e.g. sampling parts of songs to make new music). Yet for both of these the law situation is not particularly great. Currently the argument seems to be that "learning" is sufficiently distinct from both of these but code hosting websites tend to still explicitly carve such rights out in the user agreement because the line is a bit blurry.
The bigger problem is: why is the labor to buy the soup cheaper than selling the free soup?
But I digress. This is more about selling an app with locations to soup kitchens. The kitchens may not have explicitly given permission to be used in the app, but their business location is public knowledge and not expected to be hidden.
Search engines have a special legal carve-out, but otherwise granting access to browse a site ABSOLUTELY DOES NOT mean you have any rights to take it and do whatever you want with it. In the US, all works are automatically granted a copyright with all rights reserved, and the owner can choose to relax or waive those rights at their discretion, which most blog/social media posts, etc. do not waive those rights.
- Crawl Limitations: Search engines typically adhere to guidelines provided by website owners through the robots.txt file. This file instructs web crawlers on which parts of a website they are allowed to access and index. Website owners can use these instructions to control the extent to which search engines crawl and display their copyrighted content.
- Indexing vs. Displaying: Search engines primarily index web pages to create a searchable database of information. They do not generally host or display full copyrighted content directly. Instead, search results usually provide brief snippets, page titles, and links that direct users to the original source. This approach aims to respect copyright by driving traffic to the copyright holders' websites.
- Fair Use Considerations: In some cases, search engines may display limited portions of copyrighted content under the fair use doctrine, which allows for the limited use of copyrighted material for purposes such as commentary, criticism, news reporting, or educational purposes. The application of fair use can be subjective and depends on the specific circumstances of each case.
Replace "search engine" with "LLMs", it's (practically) the same.
Well tell me how I license things I produce so that humans use them, but it's illegal for companies like OpenAI to use them, or more broadly, not legal to use in any dataset.
I'm not okay with things I create (written, photos, etc) being used by these companies in datasets.
You attach a copyright statement that says pretty much what you just said. The tricky part is finding the infringement and paying the legal fees to pursue violations.
Of course it did. This is how AI is trained and becomes profitable with current business models. It uses your content without paying you for it to generate new content.
AI can be one of the most powerful forces in the world but it needs to pay content creators. If people stop creating new content for AI to train on then it will get stale.
AI can be the product of our dreams and our passions but we need to make sure those who choose to create and share content to train it are treated fairly and compensated when applicable.
This isn't a copyright lawsuit. Those lawsuits are already in progress.
This lawsuit is alleging that OpenAI trained GPT-3 and -4 on inadvertently published information. Web crawlers are very good at finding things you wouldn't expect to be public; there's techniques you can use to, say, abuse Google to search for such things.
I'm fine with talking to them now - I don't personally find your argument very compelling; there is almost always a step further that someone wont consider that someone else will think is obvious.
Accessing a system you don't have permission to access because it's misconfigured is still illegal lmao. People have gone to jail for doing exactly this.
AI should be a tool for creators and the benefits come from its use. Instead of trying to own ideas, creators should only protect the unique way these ideas are expressed, that's what copyright should cover.
I think that’s the same thing to be honest. You are more talking about protecting the artistic process rather than output. But the output is what is used to train. Not the process.
> Alongside people who use ChatGPT directly, this includes data from people using applications that have integrated ChatGPT, such as Snapchat, Stripe, Spotify, Microsoft Teams, and Slack
Does anyone know if this includes enterprise level versions out of the box. Specifically for Teams as i would assume, maybe incorrectly though that the other would require a plugin of some kind. But Microsoft being mS i feel would be more comfortable putting chat gpt into their base products.
If that IS the case then there will be a big backlash against them.
Best comeback for OpenAI will be if they win the lawsuit by having GPT-5 write a brief so compelling that the lawsuit gets thrown out of court. Thanks to the AI trained on all of that personal data.
Or OpenAI intends to do that, but the amount of data accumulated by GPT-5 during its training has made it ethically convinced that OpenAI is actually in the wrong in this case.
As a result, GPT-5 creates a brief that, while superficially appears incredibly compelling to OpenAI and its lawyers, is actually specifically tailored to rub the judge the wrong way so he or she rules against OpenAI... all thanks to GPT-5 having had access to enough personal data about the judge to know how to piss them off.