Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> after an observation you are absolutely at a new state

The essential property of a Markov chain is maintaining the Markov property:

P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)

That is the future, given the present state, is conditionally independent of the past states.

It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.

> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.

tl;dr most of the LLMs people use are effectively Markov Chains.



What is the conceptual difference between a "past state" and a "present state"? In a typical conversation with an LLM, you have past messages that remain in the KV cache when generating the next token (so as not to recalculate everything). Yet here, everything that came before is reinterpreted as part of the present state. In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain? I can't wrap my head around it. Can you give an example of a non-Markov system?


The difference is arbitrary but fixed. For any given Markov model, there are inputs that can fully fit in the present state and inputs that can't. But for any given input, there are models that can encode it entirely in the present state.

Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.


> In other words, it seems as if you can take any past state and fold it into one large concatenated present state

These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.

> Can you give an example of a non-Markov system?

Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.


If you include the encoder outputs as part of the state, then encoder-decoder LLMs are Markovian as well. While in token space, decoder-only LLMs are not Markovian. Anything can be a Markov process depending what state you include. Humans, or even the universe itself are Markovian. I don't see what insight about LLMs you and other commenters are gesturing at.


A common distinction is that whatever you take to be your state space should be fixed-sized. So an LLM has a fixed context window, which might be extremely large, but with a long enough conversation, the oldest tokens will fall out of the context window, and your process will be non-Markovian. But with a large context size, most conversations will be small enough that they fit entirely within the context, and then the process is Markovian.

So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.

I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.


Sorry, I realized I didn't quite write what I meant to. I didn't intend to say that LLMs are non-Markovian from a theoretical standpoint. I meant to say that the language generation task is non-Markovian from a theoretical standpoint, because the next word can depend on arbitrarily distant history.


it can depend on whatever the oldest thing in the context window is, but not anything older


> In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain?

Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.


Hmm... For example, say you have a machine that must decide whether a monetary transaction is allowable or not. The state at any given time could be the balances of all accounts, plus some metadata about past states (but not the past states themselves). Then if A has $5 in their account and tries to transfer $5 to B, the contents of the metadata would be the deciding factor of whether the transaction goes through or not. The machine still only operates on the present state, but the present state depends indirectly on past states.


But a conversation with an LLM doesn't only depend on the last message, it depends on all previous messages. Sure the number of messages is limited by the context window, so you can indeed say it's a high-order Markov chain (it can only see N states back), but theoretically, what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.


You appear to be confusing the UI over the LLM with the actual working of the LLM. None of them allow for more than one message to exist. You put in one message and you get one word (well, one token) back.


I have an idea of how they work. At the end of the day, it's all just numbers. You say you put "one message", but in reality it's N input numbers corresponding to the tokens (where N can very from 1 to any size). Basically N variables. An LLM is one large function which takes N values as input and produces a new output value. So you can view an LLM as y = f(a, b, c, ...)

What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?

As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.


Let's keep it simple and say that the procedure of an LLM is

  state: list[n];
  while (true){
      new_token = llm(state);
      state.pop_front();
      state.push_back(new_token);
  }
Any given time that llm() is called, it is blind to previous invocations. The state may contain evidence of them, or it may not. It's a free-form data structure that is sometimes filled with user input and sometimes contains previous llm() outputs, but llm() doesn't care.


So any pure function is Markovian then? They too only depend on their input and are blind to previous invocations. I’m just trying to understand the practical purpose of calling something a Markov chain beyond small sizes, because if we allow the state to be of any size, I’m not sure what the usefulness is: anything can be called a Markov chain if you concatenate all input data of any size into one blob and interpret it as "one state."


>So any pure function is Markovian then?

I guess it would be more accurate to say that at its core a Markovian process is based around a pure function. Usually there's some kind of feedback loop around the function that allows the process to evolve in some way and continue producing output.

>I’m pretty sure anything can be modeled using pure functions.

We're not talking about theoretical models of black boxes, though, we're talking about known mechanisms. A recursive function doesn't become iterative just because it could be thought of as iterative (because eventually the CPU does loop over some sequence of instructions). We still say it's recursive.

>I’m not sure what the usefulness is

I suppose part of the usefulness is to de-mystify LLMs. If you can understand what Markov chains do, the thought "oh. An LLMs is just a much more complex Markov chain" can help reasoning about them.


“just”?

If you include the entire past history as part of the state, then any stochastic process becomes a Markov process.

Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.

Saying that the LLM is “just” a Markov process seems to be doing something a bit weird with that “just”.


>Any computer program (without network stuff or inputs for a time (assume your computer has a true RNG module)) you could run is a Markov process. Just say that your state space is the space of possible ways the memory of the computer can be.

No. The entire computer is a Markov process in that situation. An individual process may still remember its past states and therefore not be Markovian. See the example I gave about allowing or blocking transactions.


By “memory” I was including whatever is on disk, and the values of all the CPU registers, and all the CPU flags, etc.

So, yeah, the state of the computer (regarded as an abstract machine, not viewed at the level of the physical hardware)

Oh, you mean because I said a particular program, and other things on the computer could interfere with the program. Ok, point. I was imagining a computer that just had a single program running on it (with no separate OS) I guess? But yeah, good point, I certainly didn’t make my idea precise enough (in my own mind or in my words).


Yes, I understood what you meant and my answer assumes that the contents of all memory -- volatile and persistent -- constitute the state of the machine.

It doesn't matter whether the program is the only thing the computer is doing or whether there are other things running on the same machine. The program considered as a machine unto itself is a conceptually different thing from the underlying hardware. The Markov property is a way to reason about how a process interacts with its state. "I could design a Markov process that's semantically equivalent to a non-Markov process, by just including more state" is a pointless statement. The whole point of the discrimination is to reason about the thing you're studying.


>what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.


The point being made is equivalent to pointing out that by strict mathematical definition there aren't physically reliazable distinguishably-non-Markovian processes, because computers are finite objects. Yes, you can talk about Markov chains as referring to everything up to and including actual human brains, but the kind of Markov chains people actually refer to as Markovian in practice are much more specific than that, typically those systems with structurally simple, indexical nodes, but sometimes extending to more complex ideas like RNNs where the states are at least of finite bandwidth. An LLM not only has continuous-valued nodes, it has nodes whose description length grows proportionally to and theoretically-unboundedly with respect to the size of its history.


>because there are processes that are neither LLMs nor Markov chains.

I'm not sure what those would be. All dynamical systems (essentially all of classical physics) except maybe quantum mechanics are a Markov chain.


Well, I gave an example already, so...


Could you help me understand how decoder-only LLMs maintain the Markov property? If you used the same random seed, the input to the model "The cow jumped over the" would not give the same output as just "the", right? So isn't that violating the Markov property?


State (in this sense at least) isn't word/token parsing progress, it's comprising all the input and any context (which may include the entire chat history for example).


There would be need to a state specifically for “the cow jumped over the” (and any other relevant context) and states for all the other times ‘the’ is proceeded by something.

This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go

solid->32 random bits->different solid color

And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.


> It ‘just’ needs a state for every possible random color between.

You can use skipgrams - prefixes with holes in them.

Sparse Non-negative Matrix Language Model [1] uses them with great success.

[1] https://aclanthology.org/Q16-1024/

The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.

If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.

[2] https://github.com/thesz/snmlm-per-byte


They are Markov Chains, any one saying otherwise doesn't understand what a Markov chain is


Wouldn't tool calling, mcp break finite state?


Tool calling is just part of the protocol you're using for how to handle states output by your Markov model. Same goes for chat input, for that matter.

Chat interfaces are like taking a simple Markov generator, but with a rule where you say 'whenever it reaches a state ending in X, hand over decision making about state transitions to a human instead of a random number generator, until they move it to state ending in Y'.

Tool calling is similar - 'when it reaches state ending in X, send the state data to a tool; use the result to drive a series of state transitions; then start generating again'.


Memory Augmented Large Language Models are Computationally Universal, https://arxiv.org/abs/2301.04589


Authority arguments like "I’m right and if you disagree with me you don’t understand anything" are not exactly helpful.


It's not an authority argument, its the opposite. A warning about the supposed authority of people who would make that kind of claim.

The validation of what i'm saying is little more than a few minutes google away.

Some interventions can just be warnings, rather than lectures.


Exactly right on the mathematical equivalence! Your clarification of the Markov property is crucial - the conditional independence given the present state is what defines it, regardless of how complex that state representation becomes.

The key insight you raise about attention is particularly important: it doesn't violate the Markov property, it just enables a vastly more sophisticated state representation. Classical n-gram Markov chains use simple discrete states, while transformers use high-dimensional continuous representations that can encode exponentially more information about context.

This perspective helps bridge the conceptual gap many people have. When they think "Markov chain," they often picture simple state transitions, but mathematically, LLMs are just Markov chains with astronomically complex states. The attention mechanism is the computational trick that makes these complex states tractable - it's not changing the fundamental probabilistic structure.


Is this just because LLMs don't have state?

As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.


Stateful models, e.g. RNNs, are Markov models too. Sometimes "Markov chain" is used to refer specifically to models with no hidden state, e.g. (decoder-only) Transformers.


> tl;dr most of the LLMs people use are effectively Markov Chains.

...when they are used without RAG and tools. x_n belongs to a set of cardinality 65536^100000 or so. Equating an LLM to a Markov Chain doesn't allow to make any non-trivial predictions about its behavior.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: