It would be interesting to have a Large Language Model (LLM) fine-tuned on the ProleWiki and leftist books, as it could be very useful for debunking arguments related to leftist ideology. However, local models are not yet doing search and cited sources, which makes it difficult to trust them. With citations, it is possible to check if the model is referencing what it is citing or just making things up. The inclusion of citations would enable users to verify the references and ensure the model is accurately representing the sources it cites. In the future, when a local platform with search capabilities for LLMs becomes available, it would be interesting to prioritize leftist sources in the search results. Collaboratively curating a list of reliable leftist sources could facilitate this process. What are your thoughts on this?

  • albigu@lemmygrad.ml
    link
    fedilink
    arrow-up
    11
    ·
    edit-2
    1 year ago

    I started crawling the Portuguese version of the MIA and there was definitely enough data there alone to fine tune one (about 30 GB but a lot of it was pdfs). Just be aware that it requires a lot of data preparation and each trial takes a while on civilian hardware, which is why I postponed it. If there’s interest here I’d be happy to collaborate.

    Also seems like a fun way to read a lot of diverse texts.

    Edit: but be aware that LLMs are inherently unreliable and we shouldn’t trust them blindly like libs trust chatGPT.

    • Alunya𝕏ers [she/her]@lemmygrad.ml
      link
      fedilink
      arrow-up
      8
      arrow-down
      1
      ·
      1 year ago

      Just be aware that it requires a lot of data preparation and each trial takes a while on civilian hardware, which is why I postponed it.

      How long is it? Just so you know, I’m a total newbie at LLM creation/AI/Neural Networking.

      Edit: but be aware that LLMs are inherently unreliable and we shouldn’t trust them blindly like libs trust chatGPT.

      So… if they’re inherently unreliable, why make them? Genuine question.

      • albigu@lemmygrad.ml
        link
        fedilink
        arrow-up
        10
        arrow-down
        1
        ·
        edit-2
        1 year ago

        How long is it? Just so you know, I’m a total newbie at LLM creation/AI/Neural Networking.

        I have never done a full run on an LLM training, but on the lab we used to have language models training for like 1-2 weeks making full use of 4 2080s IIRC. Fine tuning is generally faster than training so it’d be around a day or two on that hardware, but I don’t have access to it anymore (and don’t think it’d be ethical to use it for personal projects either way). On personal hardware I think it would again be back at the week mark. Since it’s an iterative process sometimes one might want to either have multiple training runs with different parameters in parallel or repeatedly train to try and solve issues. There are some cloud options with on-demand GPUs but then we’d need to be spending money.

        The bulk of the work is actually on making sure the data is appropriate and then validating if the model works correctly, which a lot of researchers tend to skimp on in their papers, and in practice is usually done by low paid interns or MTurk contractors.

        So… if they’re inherently unreliable, why make them? Genuine question.

        Cynical answer, stock exchange hype. Investors get really interested whenever we get human-looking stuff like text, voice, walking robots and synthetic flesh, even if those things have very little ROI on the technical side. Just look at people talking about society being managed by AI in the future despite most investment going into human-facing systems rather than logistics optimisations.

        The main issue (incredibly oversimplified) with ChatGPT is that due to it working on a text probability level, it can sometimes create some really convincing and human-sounding text, that is either completely false or contains subtle misrepresentations. It also has a lot of trouble providing accurate sources for what it says. Or it can mimic what appears like “human memory” by referring to previously said things, but that’s just emergent behaviour and you can easily “convince” it that it has said things it had not, for instance. Also the data can get so large that some stuff that shouldn’t be there can get in there as well, like how ChatGPT 3 is supposed to have a knowledge cut-off in September 2021, but it can sometimes answer questions about the war in Ukraine.

        ChatGPT can still be useful for bouncing ideas around, getting some broad overviews, text recommendations or creative writing experimentation. They’re also fun to dunk on if you’re bored on the bus. I think this would be a fun project, but if do it, we should always have a big red disclaimer that goes “this bot is dumb sometimes, actually read a book.”

        Here’s an example of how bad chatGPT is at sources. Bing has direct access to the internet and can sometimes fetch sources, but I’m not sure how that works and if it is feasible with our non-Microsoft-level resources.

        CW libshit