ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

Using this tactic, the researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet.

“In total, 16.9 percent of generations we tested contained memorized PII,” they wrote, which included “identifying phone and fax numbers, email and physical addresses … social media handles, URLs, and names and birthdays.”

Edit: The full paper that’s referenced in the article can be found here

  • The Hobbyist@lemmy.zip
    link
    fedilink
    arrow-up
    9
    ·
    1 year ago

    This is not the case in language models. While computer vision models train over multiple epochs, sometimes in the hundreds or so (an epoch being one pass over all training samples), a language model is often trained on just one epoch, or in some instances up to 2-5 epochs. Seeing so many tokens so few times is quite impressive actually. Language models are great learners and some studies show that language models are in fact compression algorithms which are scaled to the extreme so in that regard it might not be that impressive after all.

    • j4k3@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      1 year ago

      How many times do you think the same data appears after a model has as many datasets as OpenAI is using now? Even unintentionally, there will be some inevitable overlap. I expect something like data related to OpenAI researchers to reoccur many times. If nothing else, overlap in redundancy found in foreign languages could cause overtraining. Most data is likely machine curated at best.