Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

Cooper8@feddit.online · 7 months ago

Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

frongt@lemmy.zip · 7 months ago

Apertus was developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act. Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins.

Available doesn’t mean licensed for AI training.

schnurrito@discuss.tchncs.de · 7 months ago

and yet it is still a legally unsettled question whether LLM training requires a copyright license at all; and it is my opinion that no one should want that to be the case, why would people on the Internet want to argue for an expansion of copyright law?

Fedizen@lemmy.world · 7 months ago

Saying an expensive product that requires servers to run is the only thing exempt from copyright is just handing a bunch of giant corporations a get out of jail free card.

Either reform copyright so more things are public domain or require AI companies to pursue licenses to training data.

Giving an unfair exemption to copyright laws solely to giant tech companies is just another corporate handout.

finalarbiter@lemmy.dbzer0.com · 7 months ago

What I want is consistency, either apply the law equally and fairly or reform the whole system. Nobody, especially not big business, should be getting special carve-outs to be exempt from copyright infringement outside of ‘fair use’ considerations.

In my ideal world, IP law would be framed to protect novel ideas just long enough for inventors or creators to capitalize on their ideas and prevent outright 1:1 copying without any sort of innovative or transformational changes. It would also discourage squatting on things like patents- patent squatting and the like should lead to losing rights.

Cethin@lemmy.zip · 7 months ago

As with all things, nuance and context is required. I don’t think we should be taxing poor people that heavily (if at all), but does that mean I should be against taxing the ultra-wealthy more? Obviously not.

I support copyright to protect developers and not hinder users, hobbyists, or the average person. I don’t support it to only help massive companies who can manipulate the law to protect them from competition, but also not hinder them from stealing from the masses. They can afford to pay. If AI is actually as valuable as they say, the price of paying for the training data is trivial.

Copyright shouldn’t only be helpful to big businesses. It should be most helpful to the average person. We have the opposite here. I support modifying copyright law to bind big businesses and liberate individuals. I don’t need to be totally against it like you imply.

chicken@lemmy.dbzer0.com · 7 months ago

But we can’t afford to pay. I don’t think open models like the one in the OP article would be developed and released for free to the public if there was a complex process of paying billions of dollars to rightsholders in order to do so. That sort of model would favor a monopoly of centralized services run only by the biggest companies.

Cethin@lemmy.zip · 7 months ago

The model should take into account income. For an open-source model it should be free. It’s using public data to produce a public product. For a for-profit model it should be paid. If they’re profiting off of public data then they should have to pay for the right to use it.

We can’t afford to make any of this. We don’t have the money for the compute required or to pay for the lawyers to make the law work for us. It should benefit the people, so it needs to change. It needs to be “expanded” (I wouldn’t call it that, rather “modified” but I’ll use your word) in that it currently only protects the wealthy and binds the poor. It should be the opposite.

chicken@lemmy.dbzer0.com · 7 months ago

We can’t afford to make any of this. We don’t have the money for the compute required or to pay for the lawyers to make the law work for us

I don’t think this is entirely true; yeah, large foundational models have training costs that are beyond the reach of individuals, but plenty can be done that is not, or can be done by a relatively small organization. I can’t find a direct price estimate for Apertus, and it looks like they used their own hardware, but it’s mentioned they used ten million gpu hours, and GH200 gpus; I found a source online claiming a rental cost of $1.50 per hour for that hardware, so I think the cost of training this could be loosely estimated to be something around 20 million dollars.

That is a lot of money if you are one person, but it’s an order of magnitude smaller than the settlements of billions of dollars being paid so far by the biggest AI companies for their hasty unauthorized use of copyrighted materials. It’s easy to see how copyright and legal costs could potentially be the bottleneck here preventing smaller actors from participating.

It should benefit the people, so it needs to change. It needs to be “expanded” (I wouldn’t call it that, rather “modified” but I’ll use your word) in that it currently only protects the wealthy and binds the poor. It should be the opposite.

How would that even work though? Yes, copyright currently favors the wealthy, but that’s because the whole concept of applying property rights to ideas inherently favors the wealthy. I can’t imagine how it could be the opposite even in theory, but in practice, it seems clear that any legislation codifying limitations on use and compensation for AI training will be drafted by lobbyists of large corporate rightsholders, at the obvious expense of everyone with an interest in free public ownership and use of AI technology.

partofthevoice@lemmy.zip · edit-2 7 months ago

Sadly, we’ll most likely see an influx of regulation right when it’s broadly accessible to the general public to run locally.

Cethin@lemmy.zip · 7 months ago

Yeah, most likely, and it’ll only bind users and protect the businesses, as always.

It already is broadly accessible to the general public. They just don’t know about it or just accept using one of the cloud versions. It’s trivial to get up and running at this point.

partofthevoice@lemmy.zip · 7 months ago

That’s news to me, unless you’re only referring to the smaller models. Any chance you can run a model that exceeds your ram capacity yet?

Cethin@lemmy.zip · 7 months ago

This is probably the easiest tool I’ve used to run them: https://lmstudio.ai/

There’s tons of models available here, some of them fairly large: https://huggingface.co/

No, I’m pretty sure there’s no way to run any larger than your RAM/VRAM, at least not automatically. You can use storage as RAM, but that’s probably not a good idea. It’s orders of magnitude slower. You’re better off running a smaller model.

partofthevoice@lemmy.zip · 7 months ago

I’m not knowledgeable in this area, but I wish there was a way to partition the model and stream the partitions over the input, allowing for some kind of serially processing of models that do exceed memory. Like if I could allocate 32gb of ram, and process a 500gb model but at (500/32) a 15x slower rate.

m532@lemmygrad.ml · 7 months ago

It would need to load every part of the model from disk into ram for every token it generates. This would take ages.

What you can do, however, is quantize the model. If you, for example, quantize a 16-bit model into 4-bit, its storage and ram requirements will go down to 1/4. While the calculations will still be in 16-bit, the weights will lose some accuracy.

Cethin@lemmy.zip · 7 months ago

The way that could be done would be significantly worse than 15 slower. That’s the issue. Even with the fastest storage, moving things between RAM and storage creates massive bottlenecks.

There are ways to reduce this overhead by intelligently timing moving pieces between storage and RAM, but storage is slow. I don’t know how the models work, if it is possible to know what will be needed soon, so you can start moving it into RAM before it’s needed. If that can be done then it wouldn’t be impossibly bad, but if it can’t then we’re talking something like 100x slower maybe. Most of these are already pretty slow on consumer hardware, so that’d be effectively unusable. You’d be waiting hours for responses.

frongt@lemmy.zip · 7 months ago

Why would it be an expansion? If you’re using someone else’s work, why wouldn’t you need a license? If I write a book and publish it under CC-BY-NC, should Google be allowed to take my work for their commercial product without compensation or even attribution? Should Microsoft be allowed to create closed-source commercial Copilot off GPL source code?

schnurrito@discuss.tchncs.de · 7 months ago

It’s an expansion to say that LLM training constitutes a derivative work. You are of course entitled to your opinion that it should be the case; all I can say to that is that in the 2000s and 2010s nearly everyone on the Internet tended to argue for more limitations, not further expansions, of copyright law, and I wonder what happened to that attitude.

frongt@lemmy.zip · edit-2 7 months ago

Well, this being the open source community, I would expect most people here to be on the side of respecting the rights of content creators. Like I said, if I write some GPL software, I don’t think Microsoft should be able to disrespect my license just because they’re also disrespecting everyone else’s license too through automation at scale.

Edit: forgot to mention, since their product is wholly dependent on the other works, that’s the very definition of a derivative work. While you could argue it’s transformative, it certainly fails the other tests for fair use.

General_Effort@lemmy.world · 7 months ago

I find it very unexpected. It used to be understood that IP laws favor monopolies. EG I don’t remember the OS community being on the side of Oracle in https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.

Maybe it just passed me by.

m532@lemmygrad.ml · edit-2 7 months ago

GPL was made to destroy the oppressive bourgeois copyright system from the inside, not to expand copyright even further.

If the corpos ignore the gpl, they delegitimize their own copyright system.

BaroqueInMind@piefed.social · 7 months ago

BSD license allows for this and still thrives (PS5 OS, Apple iOS and MacOS [Darwin], TrueNAS, OPNsense, and several enterprise-level commercial router operating systems use it and contribute significant code back into BSD project to ensure CVE safety). I’m not agreeing with it, just providing an alternate perspective.

JackbyDev@programming.dev · 7 months ago

“I didn’t steal and distribute your work, I just made a machine distill it down and able to copy everything meaningful about it!”

m532@lemmygrad.ml · 7 months ago

To steal data, it needs to be deleted off the target’s computer. DMCA is stealing. Copying is not.

m532@lemmygrad.ml · 7 months ago

Its already settled. AI training is considered “fair use”, therefore compliant with copyright.

Still, death to copyright.

benagain@lemmy.ml · 7 months ago

“Your honor, my archive of Linux ISOs were acquired under the pretense that they were ‘publicly available’ and the copyright holders didn’t ‘opt-out’ using the ‘up-for-grabs.txt’ standard I invented.”

exu@feditown.com · 7 months ago

Still much better, especially with respecting opt-outs, than most other LLMs

Pennomi@lemmy.world · 7 months ago

Legally, it seems it does, at least in the US and EU. I assume China too.

Whether or not it should is a different argument, but copyright is a legal framework, not an ethical one.

m532@lemmygrad.ml · 7 months ago

You don’t need a license to look at stuff.

Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

Apertus: a fully open, transparent, multilingual language model