Hacker Newsnew | past | comments | ask | show | jobs | submit | more coder543's commentslogin

It’s cool that progress is being made on alternative LLM architectures, and I did upvote the link.

However, I found this article somewhat frustrating. Showing the quality of the model is only half of the story, but the article suddenly ends there. If people are going to be motivated to adopt an entirely different architecture, then performance and context size deserve at least as much discussion.

Given the linear nature being shown, it seems like the primary thing people are going to want to see is the next frontier of LLMs: context sizes of ~1M tokens. The word “context” does not even appear in this article, which is disappointing. If there were a discussion of context, it would be nice to see if it passes the passkey test.

The article also appears to reuse a chart from RWKV-4 showing how awesome a linear function is compared to a quadratic one, but… cool story? It’s not even clear what this chart is truly showing. Is this chart only showing generated tokens, or is this including prompt tokens? As I have never used RWKV, I have no idea how the prompt processing speed compares to the token generation speed. Prompt processing speed has been a big problem for Mixtral, for example.

As a reader, I want to see a couple of actual examples of X prompt tokens + Y generated tokens, and the tokens/s of X and Y for RKWV-5 and for Mistral on the same hardware. On the Mistral side, it is trivial to collect this information in llama.cpp, but I don’t know how the tooling is for RWKV.


These models don't have a fixed context size and are progressively fine-tuned for longer and longer contexts. The context length also doesn't impact inference cost.

Another aspect of performance is not just how well does the trained model perform, but is it data efficient (performance per token trained)? The comparison with Pythia (an open GPT) is shown in the article.

The rwkv4 paper is quite detailed and has examples of prompt and responses on the last few pages

https://arxiv.org/abs/2305.13048

And iirc rwkv5 is very similar to retnet which is detailed here

https://arxiv.org/abs/2307.08621

Edit now that I thought more about, the data efficiency seems like a highly important aspect given their noble goal to be fully multi lingual. This is fairly interesting theoretically as well and for other applications where abundance of data is not a given


For linear transformers, the current metric is "perfect token recall", the ability for the model to recall a randomized sequence of data. You can find the limit of a particular model architecture by training a model of a particular size to echo randomized data, and I believe this was touched on in the zoo-ology paper.

This doesnt prevent the model from retaining sequences or information beyond this metric, as information can easily be compressed in the state, but it anything within that window can be perfectly recalled by the model.

Internal testing has placed the value for Eagle around the 2.5k ptr[perfect token recall] mark, while community fine tunes done on the partial checkpoints for long distance information gathering and memorization have been shown to easily dwarf that.

prompt processing speed benefits from the same gemm optimizations as standard transformers, with the extra benefit of those gemm optimizations working for batch inference as well (no need for vllm as memory allocation is static per agent)


RWKV does not have context size, or in other way do look at it, it does have infinite one.

As far as I understand this, there is internal state that holds new information while reading input, later information can overwrite previous ones with is arguably human like behaviour.


One of three things has to be true. Either:

a) this is false

b) perfect recall is false (ie. as the internal state is overwritten, you lose information about previous entries in the context)

c) the inference time scales by the context length.

It’s not possible to have perfect recall over an arbitrary length in fixed time.

Not hard. Totally not possible at all

That would mean you can scan an infinite amount of data perfectly in fixed time.

So… Hrm… this kind of claim rings some kind of alarm bells, when it’s combined with this kind of sweeping announcement.

It seems to good to true; either it’s not that good, or the laws of the universe no longer hold true.


(b) is the sacrifice made in these linear attention type architectures.

As a mitigation, you can leave a few normal attention layers in the model but replace the rest.


perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time


Here is a relevant tidbit from the RWKV paper "Limitations" section (https://arxiv.org/abs/2305.13048):

  First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information,it is mechanistically limited compared to full self-attention.


If later input overwrites previous input in the internal state, it means the model does have a limit to how much input it can "remember" at any given time and that limit is less than infinite.


You can think of it like your own memory. Can you remember a very important thing from 10 years ago? Can you remember every single thing since then? Some things will remain for basically infinite period, some will have a more limited scope.


I'm not sure I understand your concept of human memory.

It is pretty well established that very few people are able to remember details of things for any reasonable period of time. The way that we keep those memories is by recalling them and playing the events over again in our mind. This 'refreshes' them, but at the expense of 'corrupting' them. It is almost certain that things important to you that you are sure you remember correctly are wrong on many details -- you have at times gotten a bit hazy on some aspect, tried to recall it 'figured it out' and stored that as your original memory without knowing it.

To me, 'concepts', like doing math or riding a bike, on the other hand, are different in the sense that you don't know how to ride a bike, as in you couldn't explain the muscle movements needed to balance and move on a bicycle, but when you get on it, you go through the process of figuring out the process again. So even though you 'never forget how to ride a bike' you never really knew how to do it, you just got good at learning how to do it incredibly quickly every time you tried.

Can you correct me on any misconceptions I may have about either how I think memories work, or how my thoughts should coincide with how these models work?


I was going more for an eli5 answer than making comparisons to specific brain concepts. That main idea was that the RNN keeps a rolling context so there's no clear cutoff... I suspect if you tried, you could fine-tune this to remember some things better than others - some effectively forever, others would degrade the way you said.


There's a limit to the amount, but not to the duration (in theory). It can hold on to something it considers important for an arbitrary amount of time.


There’s a difference between the computation requirements of long context lengths and the accuracy of the model on long context length tasks.


In principle it has no context size limit, but (last time I checked) in practice there is one for implementation reasons.


From the HN Guidelines:

“Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.”

That user almost exclusively links to what appears to be their own product, which is self promotion. They also do it without clarifying their involvement, which could come across as astroturfing.

Self promotion sometimes (not all the time) is fine, but it should also be clearly stated as such. Doing it in a thread about a competing product is not ideal. If it came up naturally, that would be different from just interjecting a sales pitch.

I haven’t downvoted them, but I came close.


Ollama is built around llama.cpp, but it automatically handles templating the chat requests to the format each model expects, and it automatically loads and unloads models on demand based on which model an API client is requesting. Ollama also handles downloading and caching models (including quantized models), so you just request them by name.

Recently, it got better (though maybe not perfect yet) at calculating how many layers of any model will fit onto the GPU, letting you get the best performance without a bunch of tedious trial and error.

Similar to Dockerfiles, ollama offers Modelfiles that you can use to tweak the existing library of models (the parameters and such), or import gguf files directly if you find a model that isn’t in the library.

Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.

Running “make” on llama.cpp is really only the first step. It’s not comparable.


This is interesting. I wouldn't have given the project a deeper look without this information. The lander is ambiguous. My immediate takeaway was, "Here's yet another front end promising ease of use."


I had similar feelings but last week finally tried it in WSL2.

Literally two shell commands and a largish download later I was chatting with mixtral on an aging 1070 at a positively surprising tokens/s (almost reading speed, kinda like the first chatgpt). Felt like magic.


For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.

It's annoying that it seems to have its own model cache, but I can live with that.


vLLM doesn't support quantized models at this time so you need 2x 4090 to run Mixtral.

llama.cpp supports quantized models so that makes sense, ollama must have picked a quantized model to make it fit?


Eh? The docs say vLLM supports both gptq and awq quantization. Not that it matters now I'm out of the gate, it just surprised me that it didn't work.

I'm currently running nous-hermes2-mixtral:8x7b-dpo-q4_K_M with ollama, and it's offloaded 28 of 33 layers to the GPU with nothing else running on the card. Genuinely don't know whether it's better to go for a harsher quantisation or a smaller base model at this point - it's about 20 tokens per second but the latency is annoying.


[flagged]


> comes with a heavy runtime (node or python)

Ollama does not come with (or require) node or python. It is written in Go. If you are writing a node or python app, then the official clients being announced here could be useful, but they are not runtimes, and they are not required to use ollama. This very fundamental mistake in your message indicates to me that you haven’t researched ollama enough. If you’re going to criticize something, it is good to research it more.

> does not expose the full capability of llama.cpp

As far as I’ve been able to tell, Ollama also exposes effectively everything llama.cpp offers. Maybe my use cases with llama.cpp weren’t advanced enough? Please feel free to list what is actually missing. Ollama allows you to deeply customize the parameters of models being served.

I already acknowledged that ollama was not a solution for every situation. For running on your own desktop, it is great. If you’re trying to deploy a multiuser LLM server, you probably want something else. If you’re trying to build a downloadable application, you probably want something else.


How much of a performance overhead does this runtime add, anyway? Each request to a model would eat so much GPU for actual text generation, the cost to process the request and response strings even in a slow, garbage-collected seems negligible both in latency.


I just want a 4K Originals Plan with 1 screen.

Features:

    - Only Netflix Original content (in 4K)
    - 1 screen, since password sharing isn’t allowed anyways
    - No ads
    - No WWE
    - No smartphone games
    - No licensed content (which seems to be mostly 1080p anyways), since I can watch that elsewhere
I will offer $14/mo, at most.

Why should I be forced to pay for a bunch of stuff I don’t want or can’t use?

This is why I don’t have a Netflix subscription.

I subscribed for one month back in December to see how good/bad it was, and I was not impressed at all for $23/mo. The licensed content almost exclusively being 1080p just added insult to injury.


We're experiencing a re-bundling, you'll have to wait for the next unbundling.


It's insanity. I can't even watch 1080p content from most streaming providers on my desktop. It's always capped to 720p on Chrome/Win11. And even when a service offers 4K, the bitrate is abysmal. Trying to watch Dune via HBOMax you can see color banding and blocking throughout the entire movie.


> It's always capped to 720p on Chrome/Win11.

Netflix appear to be testing 1080p in Chrome for Windows, and support it in Chrome for Mac now.


The best I can do is 5.


[flagged]


There was a beautiful period before the fragmentation happened where there was good content on Netflix, discovery was still solid, and competitors hadn't caught up yet so lots of studios were licensing to Netflix. That is when I paid for it. But now every network has their own streaming service for $20/m and we don't subscribe to any of them.


> There was a beautiful period before the fragmentation happened where there was good content on Netflix, discovery was still solid, and competitors hadn't caught up yet so lots of studios were licensing to Netflix. That is when I paid for it. But now every network has their own streaming service for $20/m and we don't subscribe to any of them.

"I am only willing to subscribe to you on terms where the company was having to borrow billions of dollars a year to stay afloat" does not make you very valuable as a customer, and you shouldn't be surprised nobody makes what you want on that basis.


True. That is not a good customer, but what if that customer is representative of most customers? All those billions of dollars “loss-leading” customers may not pay off in the end as the company tries to raise prices while providing less for more money.


> That is not a good customer, but what if that customer is representative of most customers?

Then streaming is unsustainable and we go back to cable as the streaming providers go bust.

But it's not representative of most customers. https://deadline.com/2024/01/netflix-earnings-q4-2023-stream...


I literally paid for a month of Netflix in December, as I mentioned in the comment you replied to.

I subscribed to Netflix for many years continuously before they massively inflated their prices. Yes, I would pay for it if their prices weren’t so offensive. Netflix thinks they deserve to charge far more than anyone else, while offering content that is on average less valuable to me than the competitors that I am subscribed to.

My comment was a clear statement of what I want and what I would be willing to pay for. Netflix doesn't have to offer me what I want, and I don't have to pay them money.

Your comment did not contribute anything helpful to this discussion. A dismissive generalization about an entire population segment (without any actual data) is not a good basis for anything.


Why not go for the 14/15 price option? Do you really need the best quality video for shows you judge to be average at best? Why do you need the best most expensive package? Stay within your budget.


Yes, I need the best quality that is available if I'm going to watch something. It's non-negotiable to me. That is a personal choice, like buying 4K Blu-rays of most things that I like. If Netflix would sell 4K blu-rays of all of their Originals, I would just buy those instead of (not) subscribing, but they do that for very little of their content. I always pay for the ad-free plans on these services for a similar reason: I don't find watching ads to be an acceptable use of time.


I don’t see clear licensing on any of this, including the synthetic dataset.


M1 was 68GB/s (and the article says this too).

M2 was 100GB/s.


You're right, my mistake.


The UniFi U7 Pro that I bought was ~$200. For that money, it upgraded my network from WiFi 6 to WiFi 7, so now the 6E devices that I have are noticeably faster (since I didn't have 6E before), and my network is future-proofed for WiFi 7. It seemed like an easy win for me, so I disagree with your assessment.

You also don't need multi-gigabit internet to have a NAS that you want to connect to over the local network.


You can get 6E router for <$100. Sure future proofing may be a valid argument in some cases, but the vast majority of users simply do not need to spend that extra money.


I checked Amazon, and the cheapest one I saw was ~$140, not <$100. I saw WiFi 6 (not 6E) options for <$100.

Spending $200 now means that I don't have to spend $140 now and $140 again in two to three years.


I have never seen a WiFi 5 device benchmark anywhere close to 866Mbps in real world scenarios.

Most of the time, you're lucky to reach half of the "theoretical" numbers for WiFi due to the way that they're calculated. I would almost be amazed if you could consistently max out your 300Mbps internet connection over WiFi 5, but that's probably just within reach of WiFi 5, assuming there isn't much interference for WiFi 5 to handle.

EDIT: I checked on my old iPhone SE (1st gen), which is a WiFi 5 client device, and it was only able to achieve ~240Mbps down on the best run. Are you sure you're not bottlenecked by WiFi 5?


My iPhone SE 3rd gen is bottlenecked by my 500Mb fiber line in speed tests within my office room where the wifi AP is located. My AP is only a wifi 5 device.


Testing with a WiFi 5 client device (iPhone SE 1st gen), I'm seeing only 240Mbps down and 122Mbps up, which is far closer to what I would expect from WiFi 5.

I don't know how a WiFi 6 client (like your phone) might work around the limitations of a WiFi 5 access point, but I'm not seeing anywhere near 500Mbps out of WiFi 5 from a normal client device.


But in this case, I am still getting what I'd consider great performance out of my wifi 5 AP while using an almost 2 year old device. Moving to a wifi 6/6E/7 AP is not going to appreciably make my Internet experience on my phone any better or faster. I felt this was relevant because the article is about wifi 7 access points.


I agree your comment was relevant. The person I was responding to said they weren’t using WiFi 6 at all, which I interpreted to mean that all of their client devices were also still WiFi 5, but my interpretation could have been wrong.

If you can run docker somewhere in your network, you could consider running an OpenSpeedTest server and browse to that from your iPhone. This would let you remove your ISP from the equation and see just how fast your WiFi 5 connection can go.

I’m also mildly skeptical that your access point is WiFi 5, but instead might actually support WiFi 6, since you’re already close to the maximum speeds I was seeing out of a WiFi 6 access point and WiFi 6 client device before I upgraded my network. Maybe WiFi 6 clients can do some serious magic with WiFi 5 access points, but it just seems… unlikely. But if you’re sure, then the numbers are what the numbers are. It just doesn’t make much sense to me.


My AP is a Mikrotik hAPac2: https://mikrotik.com/product/hap_ac2

Mikrotik claims it is a wifi 5 device on 5GHz bands. Qualcomm says the SOC used also only supports wifi 5 on 5GHz: https://www.qualcomm.com/products/internet-of-things/network...

Installing the OpenSpeedTest app on my iPhone and then running the test from my desktop which has wired Ethernet shows 646Mbps/472Mbps throughput.

Thanks for the tip about OpenSpeedTest! I have been curious to get a tool like this for other easy local speed testing! :)


gotcha, it’s cool that you’re able to get such good performance out of a WiFi 5 AP! And yep, OpenSpeedTest is nice!


>but I'm not seeing anywhere near 500Mbps out of WiFi 5 from a normal client device.

Not sure what you mean by "normal client device", but I've gotten approximately 500Mb/s on my laptop under optimal conditions (ie. direct line of sight to router and connecting to a wired device on the same LAN). The "top" speed for 2x2 80mhz under wifi 5 is 867Mb/s[1] so that roughly tracks. Under more realistic conditions (eg. wall between router and device, or transferring files between two wireless devices), you'd expect less.

[1] https://en.wikipedia.org/wiki/IEEE_802.11ac-2013#Data_rates_...


Are you sure that the 2016 iphone se can do wide channels and multiple streams? Those are wifi5 features, but devices do not have to support them. Highend ones often do.


What do you stream to your iPhone that you're bottlenecked by 500 Mbps? Call of Duty ISOs?


As mentioned, I'm just doing a speed test to show the bottleneck is my fiber line and not the wifi, hence my wifi is giving significantly better performance than was expected by the commenter I replied to.


It depends on how new/expensive your router is, how strong the signal is/close you are, and how congested the channel is (at that moment). Some Wifi 5 (802.11ac) routers are significantly faster than the average.


When I used the Macbook Pro (13", 2015) with Turris Omnia, my jaws dropped. I was getting gigabite over wifi. No wonder that the mbp had no ethernet.

It turned out, that Apple at the time used 3x3 MIMO ac and Turris happened to also be one.

With more common setup (2x2 mimo, 80 mhz wide channel), it is not a problem to achieve near 800 mbps speed with wifi5. 300 Mhz is too low; that must be either very noisy environment, or misconfigured AP (no mimo, narrow channel).


What do you like about it? Compared to GPT-3.5, Claude Instant seems to be the same or worse in quality according to both human and automated benchmarks, but also more expensive. It seems undifferentiated. And I would rather use Mixtral than either of those in most cases, since Mixtral often outperforms GPT-3.5 and can be run on my own hardware.


Data extraction mostly. Supports long document, cheaper input tokens than gpt3 turbo, and when I ask to stick to document informations it doesn't try to fill in gaps out with his trained knowledge.

Sure you can't have a chat with it or expect it to do high level reasoning, but has enough to do the basic deductions for grounded answers.


I'm not the person you're replying to, but that sentence comes from this research: https://arxiv.org/abs/2307.11760


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: