• 0 Posts
  • 12 Comments
Joined 1 year ago
cake
Cake day: June 22nd, 2023

help-circle







  • This article is full of errors!

    At its core, an LLM is a big (“large”) list of phrases and sentences

    Definitely not! An LLM is the combination of an architecture and its model parameters. It’s just a bunch of numbers, no list of sentences, no database. (Seems like the author confused the word “LLM” with the dataset of the LLM???)

    an LLM is a storage space (“database”) containing as many sample documents as possible

    Nope. This applies to the dataset, not the model. I guess you can argue that memorization happens sometimes, so it might have some features of a database. But it isn’t one.

    Additional data (like the topic, mood, tone, source, or any number of other ways to categorize the documents) can be provided

    LLMs are trained in an unsupervised fashion. Just sequences of tokens, no labels.

    Typically, an LLM will cover a single context, e.g. only social media

    I’m not aware of any LLM that does this. What’s the “context” of GPT-4?

    software developers have gone to great lengths to collect an unfathomable number of sample texts and meticulously categorize those samples in as many ways as possible

    The closest real thing is the RLHF process that is used to fine tune an existing LLM for a specific application (like ChatGPT). The dataset for the LLM is not annotated or categorized in any way.

    a GPT uses the words and proximity data stored in LLMs

    This is confusing. “GPT” is the architecture of the LLM.

    it is impossible for it to create something never seen before

    This isn’t accurate, depending on the temperature setting, an LLM can output literally any word at any time with a non-zero probability. It can absolutely produce things it hasn’t seen.

    Also I think it’s too simple to just assert that LLMs are not intelligent. It mostly depends on your definition of intelligence and there are lots of philosophical discussions to be had (see also the AI effect).


  • Whether something is derivative or not is one of the key questions used to determine whether the free use of someone else’s copyrighted work is fair, as in fair use.

    I think training an AI model is not fair use. It’s either derivative work and needs a license or it’s not derivative work and can be used without a license. In both cases it’s not fair use (in the legal sense of “fair use”).

    I’m not sure if you’re making an argument about what the law currently says or what it should say. In my opinion the law should be updated to clarify if you need a license to use copyrighted material as training data.

    The amount that artists would be paid would be determined by negotiation between the artist (the rights holder) and the entity using their work

    Sure, my point is such an agreement will never be made. It’s a good deal for AI companies to use the data for free, but if they can’t do that, they will not be interested.

    Either way, I think there is no way for artists to win this. It’s completely possible to train large image generators without copyrighted material. These datasets are so large that paying artists per image will never be feasible.