OpenAI just admitted it can’t identify AI-generated text. That’s bad for the internet and it could be really bad for AI models.::In January, OpenAI launched a system for identifying AI-generated text. This month, the company scrapped it.

  • lily33@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.

    • Womble@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      This is not the case. Model collapse is a studied phenomenon for LLMs and leads to deteriorating quality when models are trained on the data that comes from themselves. It might not be an issue if there were thousands of models out there but there are only 3-5 base models that all the others are derivatives of IIRC.

      • volodymyr@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        1 year ago

        People still tap into real world while AI does not do that yet. Once AI will be able to actively learn from realworld sensors, the problem might disappear, no?

        • vrighter@discuss.tchncs.de
          link
          fedilink
          English
          arrow-up
          0
          ·
          1 year ago

          They already do. where do you think the training corpus comes from? The real world. It’s curated by humans and then fed to the ml system.

          Problem is that the real world now has a bunch of text generated by ai. And it has been well studied that feeding that back into the training will destroy your model (because the networks would then effectively be trained to predict their own output, which just doesn’t make sense)

          So humans still need to filter that stuff out of the training corpus. But we can’t detect which ones are real and which ones are fake. And neither can a machine. So there’s no way to do this properly.

          The data almost always comes from the real world, except now the real world also contains “harmful” (to ai) data that we can’t figure out how to find and remove.

          • volodymyr@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            There are still people in between, building training data from their real world experices. Now digital world may become overwhelmed with AI creations, so training may lead to model collapse. So what if we give AI access to cameras, microphones, all that, and even let it articulate them. It would also need to be adventurous, searching for spaces away from other AI work. There is lot’s of data in there which is not created by AI, although some point it might become so as well. I am living aside at the moment obvious dangers of this approach.