Although it’s theoretically possible someone could train a language model on Reddit alone, I’m not aware of any companies or researchers who have. The closest equivalent may be Stable LM, a language model that was panned for producing incoherent output and some mocked it for using Reddit as something like 50-60% of its dataset, tho it was also made clear that their training process was a mess in general.
How a language model talks and what it can talk about is an issue with some awareness already, though the actions taken so far, at least in the context of the US, are about what you would expect. OpenAI, one of the only companies with enough money to train its own models from scratch and one of the most influential, bringing language models into public view with ChatGPT, took a pretty clearly “decorum liberal” stance on it, tuning their model’s output over time to make it as difficult as possible for it to say anything that might look bad in a news article, with the end result being a model that sounds like it’s wearing formal clothing at a dinner party and is about to lecture you. And also unsurprisingly, part of this process was capitalism striking again, with OpenAI traumatizing underpaid Kenyan workers through a 3rd party company to help filter out unwanted output from the language model: https://www.vice.com/en/article/wxn3kw/openai-used-kenyan-workers-making-dollar2-an-hour-to-filter-traumatic-content-from-chatgpt
Though I’m not familiar enough on the details with other companies, most other language models produced from scratch have followed in OpenAI’s footsteps, in terms of choosing “liberal decorum” style tuning efforts and calling it “safety and ethics.”
I also know limited about alignment (efforts to understand what exactly a language model is learning, why it’s learning it, and what that positions it as in relation to human goals and intentions). But from what I’ve seen in limited relation to it, on the most basic level of “trying to make sure output does not steer toward downright creepy things” has to do with careful curation of the dataset and lots of testing at checkpoints along the way. A dataset like this could include Reddit, but it would likely be a limited part of it, and as far as I can tell, what matters more than where you get the data is how the different elements in the dataset balance out; so you include stuff that is maybe repulsive and you include stuff that is idyllic and anywhere in-between, and you try to balance it in such a way that it’s not going to trend toward repulsive stuff, but it’s still capable of understanding the repulsive stuff, so it can learn from it (kind of like a human).
None of this tackles a deeper question of cultural bias in a dataset, which is its own can of worms, but I’m not sure how much can be done about that while the method of training for a specific language means including a ton of data that is rife with that language’s cultural biases. It may be a bit cyclical in this way, in practice, but to what extent is difficult to say because of how extensive a dataset may be and factoring in how the people who create it choose to balance things out.
Edit: mixed up “unpaid” and “underpaid” initially; a notable difference, tho still bad either way
Slight correction and further info on this:
Although it’s theoretically possible someone could train a language model on Reddit alone, I’m not aware of any companies or researchers who have. The closest equivalent may be Stable LM, a language model that was panned for producing incoherent output and some mocked it for using Reddit as something like 50-60% of its dataset, tho it was also made clear that their training process was a mess in general.
How a language model talks and what it can talk about is an issue with some awareness already, though the actions taken so far, at least in the context of the US, are about what you would expect. OpenAI, one of the only companies with enough money to train its own models from scratch and one of the most influential, bringing language models into public view with ChatGPT, took a pretty clearly “decorum liberal” stance on it, tuning their model’s output over time to make it as difficult as possible for it to say anything that might look bad in a news article, with the end result being a model that sounds like it’s wearing formal clothing at a dinner party and is about to lecture you. And also unsurprisingly, part of this process was capitalism striking again, with OpenAI traumatizing underpaid Kenyan workers through a 3rd party company to help filter out unwanted output from the language model: https://www.vice.com/en/article/wxn3kw/openai-used-kenyan-workers-making-dollar2-an-hour-to-filter-traumatic-content-from-chatgpt
Though I’m not familiar enough on the details with other companies, most other language models produced from scratch have followed in OpenAI’s footsteps, in terms of choosing “liberal decorum” style tuning efforts and calling it “safety and ethics.”
I also know limited about alignment (efforts to understand what exactly a language model is learning, why it’s learning it, and what that positions it as in relation to human goals and intentions). But from what I’ve seen in limited relation to it, on the most basic level of “trying to make sure output does not steer toward downright creepy things” has to do with careful curation of the dataset and lots of testing at checkpoints along the way. A dataset like this could include Reddit, but it would likely be a limited part of it, and as far as I can tell, what matters more than where you get the data is how the different elements in the dataset balance out; so you include stuff that is maybe repulsive and you include stuff that is idyllic and anywhere in-between, and you try to balance it in such a way that it’s not going to trend toward repulsive stuff, but it’s still capable of understanding the repulsive stuff, so it can learn from it (kind of like a human).
None of this tackles a deeper question of cultural bias in a dataset, which is its own can of worms, but I’m not sure how much can be done about that while the method of training for a specific language means including a ton of data that is rife with that language’s cultural biases. It may be a bit cyclical in this way, in practice, but to what extent is difficult to say because of how extensive a dataset may be and factoring in how the people who create it choose to balance things out.
Edit: mixed up “unpaid” and “underpaid” initially; a notable difference, tho still bad either way