I can’t tell if this is a joke or not.
I can’t tell if this is a joke or not.
A computer like that is useful outside of work. I’d pay for it out of pocket if I had to.
You keep moving the goal posts and putting words in my mouth. I never said you can do new things out of nothing. Nothing I mentioned is approaching, equaling, or exceeding the effort of training a model.
You haven’t answered a single one of my questions, and you are not arguing in good faith. We’re done here. I can’t say it’s been a pleasure.
Do you have any examples of how they fail? There are plenty of ways to explain new concepts to models.
https://arxiv.org/abs/2404.19427 https://arxiv.org/abs/2406.11643 https://arxiv.org/abs/2403.12962 https://arxiv.org/abs/2404.06425 https://arxiv.org/abs/2403.18922 https://arxiv.org/abs/2406.01300
What kind of creativity are you talking about then? I’ve also never heard of a bloated model. Which models are bloated?
But at what point does that guidance just become the dataset you removed from the training data?
The whole point is that it didn’t know the concepts beforehand, and no it doesn’t become the dataset. Observations made of the training data are added to the model’s weights after training, the dataset is never relevant again as the model’s weights are locked in.
To get it to run Doom, they used Doom.
To realize a new genre, you’ll “just” have to make that game the old fashion way, first.
Or you could train a more general model. These things happen in steps, research is a process.
There are more forms of guidance than just raw words. Just off the top of my head, there’s inpainting, outpainting, controlnets, prompt editing, and embeddings. The researchers who pulled this off definitely didn’t do it with text prompts.
I mean, you’ve never seen a purple elephant with a tennis racket. None of that exists in the data set since elephants are neither purple nor tennis players. Exposure to all the individual elements allows for generation of concepts outside the existing data, even though they don’t exit in reality or in the data set.
The only thing I got from this is that bro loves ads more than anything in the world.
I accept regulations are real, but not all ways to help people require you dealing with regulations. I’m still waiting on that proof by the way.
There are more ways to help people than making medical software. Rather than saying they could focus on doing simpler things, you automatically jumping to all projects running afoul of FDA regulations is pretty telling. All while still having not provided a single project halted by FDA order.
Which projects have been shut down by FDA order?
Open source AI is huge, and I don’t think you need FDA approval to distribute a model. Where are you even getting that from?
What about open source projects?
Cool wolf.
This isn’t about research into AI, what some people want will impact all research, criticism, analysis, archiving. Please re-read the letter.
You should read this letter by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries.
Why are scholars and librarians so invested in protecting the precedent that training AI LLMs on copyright-protected works is a transformative fair use? Rachael G. Samberg, Timothy Vollmer, and Samantha Teremi (of UC Berkeley Library) recently wrote that maintaining the continued treatment of training AI models as fair use is “essential to protecting research,” including non-generative, nonprofit educational research methodologies like text and data mining (TDM). If fair use rights were overridden and licenses restricted researchers to training AI on public domain works, scholars would be limited in the scope of inquiries that can be made using AI tools. Works in the public domain are not representative of the full scope of culture, and training AI on public domain works would omit studies of contemporary history, culture, and society from the scholarly record, as Authors Alliance and LCA described in a recent petition to the US Copyright Office. Hampering researchers’ ability to interrogate modern in-copyright materials through a licensing regime would mean that research is less relevant and useful to the concerns of the day.
It should be fully legal because it’s still a person doing it. Like Cory Doctrow said in this article:
Break down the steps of training a model and it quickly becomes apparent why it’s technically wrong to call this a copyright infringement. First, the act of making transient copies of works – even billions of works – is unequivocally fair use. Unless you think search engines and the Internet Archive shouldn’t exist, then you should support scraping at scale: https://pluralistic.net/2023/09/17/how-to-think-about-scraping/
Making quantitative observations about works is a longstanding, respected and important tool for criticism, analysis, archiving and new acts of creation. Measuring the steady contraction of the vocabulary in successive Agatha Christie novels turns out to offer a fascinating window into her dementia: https://www.theguardian.com/books/2009/apr/03/agatha-christie-alzheimers-research
The final step in training a model is publishing the conclusions of the quantitative analysis of the temporarily copied documents as software code. Code itself is a form of expressive speech – and that expressivity is key to the fight for privacy, because the fact that code is speech limits how governments can censor software: https://www.eff.org/deeplinks/2015/04/remembering-case-established-code-speech/
That’s all these models are, someone’s analysis of the training data in relation to each other, not the data itself. I feel like this is where most people get tripped up. Understanding how these things work makes it all obvious.
They don’t train on random social media posts. Everything is sorted and approved.
Or just not show people what you’re typing.