Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microsoft Kosmos-1: A Multimodal Large Language Model (github.com/microsoft)
228 points by solarist on March 1, 2023 | hide | past | favorite | 104 comments


The examples in the paper are pretty impressive. There is an example of a windows 11 dialog image. The computer can figure out which button to press given the desired outcome of the user. If one where to take this model and scale it, I can see an advanced bot in <5 years navigating the web and doing work based on a text input of a human purely by visual means. Interesting times.


I've been following tech long enough to know that as soon as the computer can figure out which button to press it's only going to click on ads, I guarantee it.


Isn't it trivial to make a computer click on ads though? Just run selenium, apply the filtering rules from adblock and then click on a random element which would be blocked.


I think their point is the opposite - it's *not* trivial to make the computer click the correct "Download Now!" button to get Minecraft versus the other 4 that lead to malware.


Isn't it? Just run it on a machine with ublock origin.


and then that’s going to be met with MS making it “impossible” for bots to automate clicking on ads which will have the unintended consequence of making it harder to use for power users.


That explains googles Ux understanding AI. It can tell in words what the next step on a form is.


You might be interested in this: https://www.adept.ai/blog/act-1


We can't stop it, but giving an AI unbridled access to the Internet is a terrible idea. Whether it's a misphrased question or an clever prompt hack; entire sites will be crushed by the sheer superhuman performance of it.

Hackernews will be just robots chatting to each other nudging towards the latest product-hunt.


We will develop web of trust [1]. I assign weighted(!) trust to my friends, they assign to their friends. Small world [2]. This also fixes fake reviews, dependencies security, politics and whole lot of other things.

The idea is too good not to happen. I repeat it from time to time on HN. I'm currently not in the position to implement it (not sure if I ever will be, it is hard). I just hope it's created in some decentralized form before some corpo does it. When controlled by single entity it is useless.

1. http://comboy.pl/wot.html (not really worth your time, long time ago too verbose, but feel free)

2. https://en.wikipedia.org/wiki/Six_degrees_of_separation


> This also fixes fake reviews, dependencies security, politics and whole lot of other things.

All you have to do is lookup some batshit-crazy things people in your social circle already share on Facebook (or LinkedIn) to know this won't solve most of those problems.

I may trust my friend to thoroughly vet information on Disc Golf, but they may be out of their element when it comes to "Revolutionary cold fusion breakthrough", which I may get through them if there's a single generic weight for trust.


Good observation. I thought at first that it should be multidimensional, but categories are hard. Very hard.

I think general trust could work. People I trust won't give strong opinions about thing they have no idea about. You choose these people and you assign weights. It's not a random family circle.

I have some vague ideas about wallet of personalities which would serve the same purpose as categories of trust.


I've idly and very-casually thought about this for a long time, ever since decentralized filesharing really. If you ever do feel that you're in a position to take a stab at it, I'd be interested as well.


https://www.youtube.com/watch?v=efPrtcLdcdM

The internet will be saturated with fake people soon. Someone already did a PoC of this on 4chan as a joke with just a small GPT2 model. In a few years you won't be able to tell if you're talking to a human unless they're physically in front of you.


Holy tckin' cow!!! Is this real?

Captchas be damned now. Beating AI with AI. What a time to be alive.


this seems like the future to me, a huge chunk of work will be able to be done by just talking to your computer and then automating the task. Society is really going to need to adapt, knowledge workers being replaced will be as big a change as the industrial revolution replacing many manual laborers

what's interesting is how will these systems be maintained when all the junior tier engineering work is replaced by AI? Companies don't like hiring junior engineers now, will be an even bigger gap before a junior engineer becomes net productive now. Plus people building stuff using AI without understanding how things work under the hood. Seems ripe for some 40K tier situation where we have tech priests running systems that nobody knows how to build from scratch anymore


Talking to your computer? This will not last too long until another disruptive change will happen. How about computers will overtake those decisions you think you freely able to do right now. It is all running fast, accelerating actually.


I worked at a company a few years back that used YoLO trained on billions of UI screenshots to navigate the UI of any desktop application based on plain English instructions. It already exists.


Is there a better page to link to? I cannot even see "Kosmos" on this page!

Edit: Ah, looks like this is the link to the paper: https://arxiv.org/abs/2302.14045

It was discussed yesterday: https://news.ycombinator.com/item?id=34965326


Better link would have been the tweet, it includes the paper & GH repo: https://twitter.com/alphasignalai/status/1630651280019292161


It can even solve IQ tests...I mean, how much further are we moving the goal post?

Is there a model that can solve differential equations symbolically and numerically? Most of modern engineering just boils down to diff.eqs whether ordinary or partial. It's our current best method to reason about stuff and control them.


> It can even solve IQ tests...I mean, how much further are we moving the goal post?

The problem with test like this is that when trained on the existing big datasets (commoncrawl etc.), chances are the test is already in the input so the validation is not proper. Its the same thing with all the "AI beats SAT" headlines. The exercises for those very tests exist all over the internet already.


It's well documented that these models can solve variations of questions that are not found anywhere in their training set, and even entirely novel problems invented by prompters. Not with 100% success, but they can do it far with a rate far better than chance, so the idea that they're pulling responses from their training data is simply not correct.


Ok maybe I should rephrase but think of it like this: Would your IQ test score be accurate if you had a week beforehand to train solving IQ tests?

You don't get exactly the same test in the end, similar with SAT, but the constraints we put on these tests (they have to be comparable) produce patterns in the questions you can train for. This is the same logic why people can train to improve their SAT scores, if they were a measure of true innate intelligence their training would have no impact on their score.


If we flip the question around it seems more reasonable: would your IQ test score be accurate if you did not train beforehand on practice IQ tests?

I would presume not - most tests are timed, and if you are spending time on first-time only tasks in understanding the problem then the result is inaccurate. If you train out those first-time tasks so that you are repeatably using the time budget in the test to solve problems then you should reach some kind of steady state and produce repeatable and more accurate test scores.

My take is that the repeatable scores measuring your steady state in the task would be more accurate than the untrained scores with an unknown amount of initialization time within each problem. I would make a similar claim to naasking below that this could account for some of the Flynn effect.


But hasn't that always been a problem inherent to IQ tests (and SAT tests, of course)?


It is definitely one of the problems you add to a pile marked "IQ tests are meaningless and cater to the worst conceptions of human intelligence."


I mean, let's be fair. That's also how many humans learn and do IQ tests.


Yes but the critical difference is that AI couldn’t pass these tests if they weren’t trained on a very similar set of questions and answers.

Not every person takes a SAT prep class to improve their test score. There are lots of people who are truly above average in terms of intelligence and can score very high on the first try.


> Yes but the critical difference is that AI couldn’t pass these tests if they weren’t trained on a very similar set of questions and answers.

First, I don't think that's strictly true. Obviously they wouldn't do as well but they still do better than chance.

Second, there's evidence that this is a big part of the Flynn effect, which means humans are susceptible to a similar phenomenon.


“ Obviously they wouldn't do as well but they still do better than chance” is a huge understatement. If a model wasn’t trained with any SAT questions and answers but was trained with the same verbal and mathematical knowledge a high school student would have, the AI would do extremely poorly in an actual test. In contrast, the vast majority of human test takers would score leagues above picking answers by chance.

Your original reply insinuated that the AI is learning very similar to how humans do and that’s just not true. Yes, humans do pattern matching based on prior experiences/knowledge like AI does when you train a model, but human intelligence goes way beyond that.


> Yes, humans do pattern matching based on prior experiences/knowledge like AI does when you train a model, but human intelligence goes way beyond that.

Humans are trained on orders of magnitude more multimodal data over their lifetimes. Also, humans are not borne as an unbiased model, billions of years of evolution have crafted many implicit biases into our cognition (like a propensity to language, facial recognition, etc.). All machine learning models are true blank slates, so it takes a lot more data just to build up to the same starting point as a newborn human.

All that's to say that you have no basis upon which to claim that AI learning is NOT similar to how humans do it, or that human intelligence "goes way beyond hat", it's just that humans have a head start and a lot more data to work with.


“Humans are trained on orders of magnitude more multimodal data over their lifetimes. Also, humans are not borne as an unbiased model, billions of years of evolution have crafted many implicit biases into our cognition (like a propensity to language, facial recognition, etc.).”

What in the world are you talking about? I must be talking to chatgpt and am done with this thread. We were originally discussing the differences in methodology between AI and humans for passing standardized exams. Those involve tasks like applying well-defined mathematical concepts to a brand new problem, not “multimodal” data or facial recognition.


If you don't understand what I'm talking about then you don't understand how how these transformer AIs learn and solve problems, so maybe you shouldn't opine about how AI couldn't pass these tests. Your claimed differences between how humans and AIs work is conjecture that can be explained by what I described rather than fundamental differences in how these systems work.


Here's simple example: (x-x+c)=c gpt struggles with such examples if `x` is a large number e.g., x=123_000_000_456 and `c` some specific number, but it is easy for humans.


It's easy for humans once they're taught what variables mean and after 10 or so years of exposure to a real world multimodal training set that's orders of magnitude more data than GPT has seen. Also, algebra is not so easy for people with IQs lower than 90, so not exactly all humans right? What exactly am I supposed infer about how GPT or other AIs and human brains operate from this apples to cars comparison?

You don't have to point out failure modes of GPT, I know what they are. The question we're discussing here is what this indicates, if anything, about how these systems operate as compared to human brains, and whether the differences come down to training data or the fundamental architecture.


ChatGPT is not the only AI in the world.


> This is the same logic why people can train to improve their SAT scores

So isn't this literally moving the goalpost? "So what an AI can beat the SAT, so can humans"


Well, my experience was that ChatGPT produced utter nonsense including literally "1/19+1/19=2/19=1" when asked a simple probability question.


Transcript of a conversation I just had with ChatGPT:

Prompt:

I'm talking with a user named "331c8c71" on Hacker News. According to statistics I found by Googling, the number of Hacker News readers living in the USA is 36%. Assuming this is correct, what is the probability that 331c8c71 does not reside in the USA? Please explain your reasoning.

Response:

If we assume that the statistic you found is accurate, then we know that 36% of Hacker News readers live in the USA. Therefore, the probability that any given Hacker News reader does not live in the USA is 100% - 36% = 64%.

However, we do not know if 331c8c71 is a typical Hacker News reader. They could be part of a demographic that is more or less likely to live in the USA. Without more information, we cannot say for certain what the probability is that 331c8c71 does not reside in the USA.

So, the answer is 64% assuming that 331c8c71 is a typical Hacker News reader, but it may not be an accurate estimate depending on additional information about the user.

---

Like I said, these models are far from 100% accurate, and there is lots they get wrong, but they clearly are capable of some kind of reasoning that goes beyond simple text substitution of training data.


The fact that just a few years ago even parsing the question would be an achievement. I feel like people are just glossing over how incredible this answer is.


I feel like there is some contingent of people who are really bent on downplaying the achievements of AI as of late. Its objectively insane, yet somehow every discussion is still sprinkled with some form of "It told me 8x8=60 so I closed the window and never used it again"


I am impressed with LLMs but I think their inability to produce an honest "I don't know" instead of hallucinating is an issue.


True, but the problem is that they never know. They don't interact with the real world and have no way of verifying their training data's accuracy. Perhaps they could assign a confidence level to their response? But then, if they assign a response a high confidence level and produce an incorrect answer, it will compound their failure. Not only were they wrong, but they vouched for their wrong answer.


Google search was incredibly valuable immediately even if most links could have been rubbish. I can't say the same with the current LLMs

It is an incredible achievement that LLMs produce human-like output (e.g., wouldn't know if a gpt bot answers me unless we are discussing a topic where precision/accuracy are important) but they hallucinate (they are confident BS-generators).

The hype is that LLMs can solve any problem and replace humans (jobs). It is not so.

It may depend on what you do but I find it is easier/faster to do the work myself then to spot and fix [a possibly subtle] error in AI output. Though some of the specific things will improve in time and you can find tasks where AI is useful even today.

I don't see how the models can improve for general tasks (AGI) without being existential threat to humans (not just jobs).


Also it doesn't seem like such a leap to couple these language models with dedicated computation systems and similar. Think training a model to feed prompts to Wolfram Alpha to actually compute the results, then reporting back.


"reasoning" is a strong word. "Pattern extraction and application" is a better description of what is happening.

In particular, LLMs fail miserably at tasks like "apply this simple pattern many times in succession" aka "for-loop", because they can't count in an abstract way, only on concrete contexts.


IQ test, the sat, and exact mathematics are very different fields though. ChatGPT etc. are very good at "emulating" by sheer force and size logic, exact math doesn't work with "emulation"


> I mean, how much further are we moving the goal post?

Look at it this way: humans don’t have BPE-encoded text as input to their brain. It is ALL visual input. For AGI, you would at least need to add audio input as well. And be driven by action and reward.

The learning capabilities of the brain are currently beyond the processing capabilities of current architectures. Just the notion of a model receiving only pixel data that contains a question and being able to output voice data that produces a correct answer, using no partial model trained on another corpus, is probably not tractable without significant improvements.

But the models can be very useful without being AGI!


AGI is closer to tokenization than you might think. I realized this recently when trying to do audio prediction.

There was recently a project called riffusion which generates spectrograms, then recovers audio from the spectrograms.

You might be tempted to apply this to predict speech. But speech isn’t like music. We’re communicating in language, using a sequence of tones. It’s why most speech codecs use linear predictive coding. Predicting the waveforms won’t get you anywhere; no semantic understanding of language.

So the next step up is to divide speech into a series of tones, and try to predict those sounds rather than raw waveforms.

Except… that’s literally tokenization. And there’s some evidence that this is precisely what our brains are doing.


There is definitely something symbol-adjacent that needs to happen inside of the model; this is what I assume happens in the brain. But it is not purely symbolic.

For instance, consider voicing the end of a letter: “I will definitely not be stabbed in the bac…” (where the word "back" quickly devolves into a line that crosses through the rest of the letter). It goes from symbolic to contextual, implying that the author was stabbed midway through writing it, so the voicing must end with a yell of playful agony.

The same goes for calligraphic art, such as the Al Jazeera logo, for instance, which is intended to be understood as both a sequence of Arabic letters, and a depiction of a fire. A model seeing this image for the first time, needs to see it both ways at the same time.

But it’s true that we can’t just throw a transformer at the problem, train it from scratch with video inputs and audio outputs, coupled with a sporadic reward, and suddenly have it be able to solve scans of civil engineering exams. The brain can do it, but not silicon (yet). It is easier to combine models that were trained on simpler losses (tokenized cross-entropy) on simpler problems (next-token prediction), and combine them. Not true AGI learning, but eventually it will fool people into believing it is.


Actually there's a whole new subfield called textless NLP doing just that: Learning language models from raw audio. https://ai.facebook.com/blog/textless-nlp-generating-express...


> It is ALL visual input.

And sound, taste, smell, touch.


don't forget humor


and doom, awe, timing, a word, absurd, in, horse, common, wonder, beauty, honor, duty, doubt, danger, fear, style, worth, direction, dread, non.

https://m.youtube.com/watch?v=SKnnF5qA31I


TIL horse sense. I was thinking what on earth is the sense of horse?


phlegmatic, choleric, sanguine or melancholic?


Its crap at the visual Raven IQ test though, it scores 22% vs an algorithm that takes random guesses scoring 17%.


I'd be cautiuous with such general statements given the rapid pace of development in this area.

Benchmark shelf lives aren't that long.

You ommitted the fact that tuning bumped it to 26% vs random.

Sure, questionable what effort is involved in that step, but at the same time, that hints to me that tuning will be the new baseline within the next 12-24 months.


Sure I would expect it to improve. But it was a bit fishy how 'it took an IQ test!' is in all the highlights but then they mumble quietly about the score that it actually got and hope no-one is listening to that bit.

Its notable that it was able to attempt it at all I suppose.


Semi related, there's a (pretty good) course at OMSCS where the main project is building an agent to solve RPM problems: https://lucylabs.gatech.edu/kbai/spring-2023/project-overvie...

And quite a lot of papers about that: https://scholar.google.com/scholar?q=%22raven%27s+progressiv...


Bet you 5 bucks I can train one that gets 100%. Just gotta train it on the ravens answer key.


It's pretty big by any standards, but you may find the work of Gradshteyn and Ryzhik solves this problem nicely.


>how much further are we moving the goal post?

My goalpost for AGI is when Microsoft can fire their entire engineering staff, replace them with AI, and not notice any decrease in productivity or quality of output.

This test is empirically verifiable (in principle). No need to argue over whether the AI scoring X% on Y assessment task is “truly” impressive or not.


You mean that isn’t the Teams origin story?


Teams is the projection of a fourth dimension ancient Eldritch horror onto the mortal plane. AI had nothing to do with it.


You’re confusing the goal - the goal here isn’t about the finish line but the point where people all concede that the finish line is actually reachable without any major, presently unthinkable advances.


Surely anyone familiar with software engineering knows that the finish line is reached when the finish line is reached. And no sooner.

Physicists at the beginning of the 20th century also thought that the finish line of physics was in sight and all that remained was tightening a few constants. Look how that turned out.


This perspective is too reductionist - we make predictions of success all the time based on first principles reasoning. It’s perfectly sane to try to predict, and make good arguments, if AGI is possible to achieve without new breakthroughs.


It wasn't very good at the IQ test. But yes, it is promising.

"Although there is still a large performance gap between the current model and the average level of adults, KOSMOS-1 demonstrates the potential of MLLMs to perform zero-shot nonverbal reasoning by aligning perception with language models."


ChatGPT does a great job on symbolic manipulation. You have to prompt it to show derivations however vs. discussing the topic at a high level.


Yeah, that could be good. I think LLMs will start to be really useful when they start to do math at human level. When this happens, the sky is the limit.


What is human level for math? Terence Tao? Average American?


Definitely not an average American. Someone who learned to do math (proofs, application) and got fairly good at it.


There's no goalpost to move. Psychologists have been saying that IQ tests are of limited scope and utility for decades. Specifically, it is widely agreed that an IQ test is not a valid way "to assess intelligence in a broader sense".

https://en.wikipedia.org/wiki/Intelligence_quotient#Validity...


I prefer not focusing on games and benchmarks. Hopefully we ll get to robotics soon


Arnold was great in that documentary!


Roger roger.


What goalpost specifically?


Solving IQ tests which measure quantitative reasoning.


There was no movement of this goalpost.


Writing down the equations is the task for the AI.


yeah there is a model that can do differential equations



I like this feature they are working on

https://arxiv.org/abs/2212.10554

as I'd say the most obvious limitation of today's transformers is the limited attention window. If you want ChatGPT to do a good job of summarizing a topic based on the literature the obvious thing is to feed a bunch of articles into it and ask it to summarize (how can you cite a paper you didn't read?) and that requires looking at maybe 400,000 - 4,000,000 tokens.

Similarly there is a place for a word embedding, a sentence embedding, a paragraph embedding, a chapter embedding, a book embedding, etc. but these have to be scalable and obviously the book embedding is bigger but I ought to be able to turn a query into a sentence embedding and somehow match it against larger document embeddings.


That seems an infinite road. Humans don't need to memorize every token in every context in order to learn. They extract patterns online as they go.


feed a bunch of articles into it and ask it to summarize

A better way (that's how humans do it) is to first summarize each article, then feed the summaries to get an overview of the topic. This way there's no need to expand the attention window.


I've thought about that one for a long time. A long time ago I was reading proceedings of TREC trying to understand why Google was so much better than the search engines I knew how to build. TREC is pretty depressing because you find that 95% of the things you might think would improve search rankings do not. Particularly before BM25 was developed people tried indexing sub documents and consolidating them and consistently struck out.

Since BERT came out there is a considerable literature of people struggling mightily to combine transformer representations of document parts into a whole that convinces me that one could spend a few lifetimes pushing a bubble around underneath that rug.

I think the best argument for your case is that people seem to get along just fine with a limited short term memory. I'd temper that with the observation that a person writing a summary is actually doing a multiple stage process in which their short term memory is attending to part of what they are writing, part of what they are reading, and they are building long term memory structures at the same time. So there is a lot going on.

In the sense that abstracts work well for information retrieval and that many of them would fit in the GPT attention window or only be a little bigger you could make the case that a fixed-size structure could be highly useful for IR.

On the other hand, many documents, such as scientific papers, are considerably bigger than the current attention window and direct summarization of a single document via transformer will still need a bigger window, more like 40,000 tokens.

A lot of things in the literature are complex, muddy, contradictory or all of the above. (Try a question like "What did Freud think about narcissism?" or "What is the clinical relevance of Bleuler's concept of ambivalence?" or "Tell me about cosmic inflation" or "What is the dark matter particle?")

Hard cases really do require matching up parts of document A with parts of document B and certainly having them in the same attention window would help an LLM do that in a natural way.

It might be completely impractical, not just because of computational scalability but possibly more fundamental scalability limits. (I'm not sure a person with a 10x bigger short term memory would really be able to solve problems better than the average person... There are transformers with a 500,000 token attention window today and they suck.)

There could be some procedure where you cut documents up into pieces in various ways, extract critical context from documents A and B and other literature and also put in the parts you want to critique against each other, or even match up different parts of the same document to do the same. Maybe a small attention window could still be used to decompose documents into knowledge graphs but it is by no means trivial to reason over a KG once you have it.

What I do know today is that I have documents >4096 tokens that I want to retrieve, cluster and classify right now and transformers were not up to the task in Feb 2023, and I am hoping for some progress soon that will help.


One of their other recent papers was work towards expanding this limit in language models, "Structured Prompting: Scaling In-Context Learning to 1,000 Examples"

https://arxiv.org/abs/2212.06713


transformers were not up to the task in Feb 2023, and I am hoping for some progress soon that will help.

I think that's coming, OpenAI is talking about some new "DV" model with up to 32k context window: https://twitter.com/transitive_bs/status/1628118163874516992

Hard cases really do require matching up parts of document A with parts of document B

The hard part here is not processing documents, it's determining which documents need to be "matched". This requires having some sort of a "knowledge map", a semantic search space of "knowledge patterns", or maybe even a traditional search engine, so that given a document A a model can find relevant documents - in its long term memory, or in a dataset, or even on the internet. Once the documents are found, you don't really need to load the whole thing into the attention window. When I read a long paper, I do it section by section - I just need to maintain a high level map of the paper in my head. I process one "knowledge pattern" at a time, and every time I do a lookup or a search for relevant patterns. I shouldn't be limiting that search to only what's in my current attention window, even if the window is a million tokens. But yes, the window should be big enough to hold at least two of such patterns (which map to chunks of text, or images/audio/etc) - the one I'm currently processing, and one that is most similar to it in the knowledge space.


I don’t trust any report of model performance from papers, unless there is a publicly accessible demo. It is way too easy to test things the model has trained on and for the model to then completely fall flat when used by people in the real world.


The fb galactica model is a good example of this. Sounded really promising, impressive paper, lots of weights. But when you actually tried it it mostly produced garbage


Another one that looks even more compelling:

Multimodal Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2302.00923

By building in chain of thought and multimodal learning, this 1B parameter model beats GPT-3.5's 170B parameter model.


It's worth noting that this is a comparatively small model (1.6B params from memory).

It'll be interesting what capabilities emerge as they grow that model capacity.


That’sa good point. There’s a paper that talks about the non-linear nature of these models. That is at some very large size they seem to show a leap in ability.


Hmm... LLMs / MLLMs might be truly a unified input / output interface of a would-be AGI, I think.


Yeah, check out the Lex Friedman podcast episode #333 around minute 52 where Andrew Karpathy talks about the OpenAI project "World of Bits" that did this.

https://youtu.be/cdiD-9MMpb0?t=3013



At Microsoft:

Hey why don't we call our new LLM Cosmos? That's taken by the Azure Cosmos DB guys Damn it... how about Kosmos-1 ?


"Fun" fact: There is another common internal service at Microsoft called Cosmos, and it is also a database.

So now there is Cosmos, Cosmos DB, and Kosmos.


Isn't there also a batch processing system named... you guessed it... Cosmos?



Did anyone else initially read that as `Kosmos~1`, and wonder what the full name of the project was?


Anyone know if this will be an openly available model?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: