Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What Andy giveth, Bill taketh away.[0]

I'm more than a little annoyed that so much data engineering is still done in Scala Spark or PySpark. Both suffer from pretty high memory overhead, which leads to suboptimal resource utilization. I've worked with a few different systems that compile their queries into C/C++ (which is transparent to the developer). Those tend to be significantly faster or can use fewer nodes to process.

I get that quick & dirty scripts for exploration don't need to be super optimized, and that throwing more hardware at the problem _can_ be cheaper than engineering time, but in my experience, the latter ends up costing my org tens of millions of dollars annually -- just write some code and allocate a ton of resources to make it work in a reasonable amount of time.

I'm hopeful that Ballista[1], for example, will see uptake and improve this.

[0] https://en.wikipedia.org/wiki/Andy_and_Bill%27s_law

[1] https://github.com/apache/arrow-datafusion/tree/master/balli...



I get a kick out of stuff like this - I’m mostly an exec these days, but I recently prototyped a small database system to feed a business process in SQLite on my laptop.

To my amusement, my little SQLite prototype smoked the “enterprise” database. Turns out that a MacBook Pro SSD performs better than the SAN, and the query planner needs more tlc. We ended up running the queries off my laptop for a few days while the DBAs did their thing.


Right. Local storage is much more performant and cost effective than network storage. I tried to run some iops sensitive workload on cloud. It turns out I need to pay several thousands dollar per month for the performance I can get on a $100 nvme ssd.


I'm working on a web app right now that does a lot of heavy/live/realtime processing with workers. The original thought was to run those workers on the node servers and stream the results over a socket, charging by the CPU/minute. But it surprised me that the performance looks a lot better up to about 16 workers if the user just turns over a few cores to local web workers and runs it on their own box. As long as they don't need 64 cores or something, it's faster to run locally, even on an older machine. Thread transport is slow but sockets are slower at a distance; the bottlenecks are in the main thread anyway. So I've been making sure parts are interchangeable between web workers and "remote socket web workers" assigned to each instance of the app remotely.


Who the fuck thinks it's a good idea to run database on networked storage?


You might have heard of this little company in Seattle called “Amazon”.


Like the river? They're gonna need to fight hard for that SEO.


Isn't that what's happening if you use any managed database product? They have probably colocated everything as much as possible and used various proprietary tricks to cut latency, but still.


The same people who will run a clustered database on VMs.


What reminded me of this the other day is how MacOS will grow your cursor if you “shake” it to help you find it on a big screen.

I was thinking about how they must have a routine that’s constantly taking mouse input, buffering history, and running some algorithm to determine when user input is a mouse “shake”.

And how many features like this add up to eat up a nontrivial amount of resources.


That particular example seems like something that's probably a lot cheaper than you'd initially think. The OS has to constantly take mouse input anyway to move the pointer and dispatch events to userspace. It also needs to record the current and new position of the mouse pointer to dispatch the events. Detecting whether the mouse is being "shaken" can be done with a ring buffer of mouse velocities over the last second or two of ticks. At 60 fps, that's about 120 ints = 480 bytes. Since you don't need to be precise, you can take Manhattan distance (x + y) rather than Euclidean distance (sqrt(x^2 + y^2)), which is a basically negligible computation. Add up the running total of the ring buffer - and you don't even need to visit each element, just keep a running total in a variable, add the new velocity, subtract the velocity that's about to be overwritten - and if this passes a threshold that's say 1-2 screen widths, the mouse is being "shaken" and the pointer should enlarge. In total you're looking at < 500 bytes and a few dozen CPU cycles per tick for this feature.


Or, alternatively, the engineer that worked on this at Apple has just read the above as another way of solving the problem and is throwing this on their backlog for tomorrow..


Thanks for the thoughtful analysis and napkin math. You may very well be right. I wonder if this is true in practice or if they suffer from any interface abstractions and whatnot.


On every modern (past few decades) platform, the mouse cursor is a hardware sprite with a dedicated, optimized, *fast* path thru every layer of the stack, just to shave off a few ms of user-perceived latency. Grab a window and shake it violently, you'll notice it lags a few pixels behind the cursor - that's the magic in action.

In some places there's no room left for unnecessary abstractions, I can imagine most of the code touching mouse / cursor handling is in that category.


I wish this was true with Ubuntu but every time I boot up is a dice roll on if my mouse will be sluggish and low -FPS or not.


If you shake the cursor up and down, it doesn't grow.

Looks like it's measuring something like number of direction changes in relation to distance traveled; ignoring the y axis completely.


That's some actual engineering there, understanding the problem you're trying to solve, instead of solving problems you don't even have. Nice.


Why not use an exponential filter that only uses a single variable?


The running-sum-difference approach suggested above is a box filter, which has the best possible noise suppression for a given step-function delay, although in the frequency domain it looks appalling. It uses more RAM, but not that much. The single-pole RC filter you're suggesting is much nicer in the frequency domain, but in the time domain it's far worse.


Is it „far worse in the time domain“ due to its infinite impulse response?


Not really? Sort of? I don't really have a good answer here. It depends on what you mean by "due to". It's certainly due to the impulse response, since in the sense I meant "far worse" the impulse response is the only thing that matters.

Truncating the impulse response after five time constants wouldn't really change its output noticeably, and even if you truncated it after two or three time constants it would still be inferior to the box filter for this application, though less bad. So in that sense the problem isn't that it's infinite.

Likewise, you could certainly design a direct-form IIR filter that did a perfectly adequate job of approximating a box filter for this sort of application, and that might actually be a reasonable thing to do if you wanted to do something like this with a bunch of op-amps or microwave passives instead of code.

So the fact that the impulse response is infinite is neither necessary nor sufficient for the problem.

The problem with the simple single-pole filter is that by putting so much weight on very recent samples, you sort of throw away some information about samples that aren't quite so recent and become more vulnerable to false triggering from a single rapid mouse movement, so you have to set the threshold higher to compensate.


Reading all of you sounding super smart and saying stuff I don’t recognize (but perhaps utilize without knowing the terms) used to make me feel anxious about being an impostor. Now it makes me excited that there’s so many more secrets to discover in my discipline.


Yeah, DSP has a lot of really awesome stuff in it! I recommend https://www.dspguide.com/.

It turns out that pretty much any time you have code that interacts with the world outside computers, you end up doing DSP. Graphics processing algorithms are DSP; software-defined radio is DSP; music synthesis is DSP; Kalman filters for position estimation is DSP; PID controllers for thermostats or motor control are DSP; converting sonar echoes into images is DSP; electrocardiogram analysis is DSP; high-frequency trading is DSP (though most of the linear theory is not useful there). So if you're interested in programming and also interested in graphics, sound, communication, or other things outside of computers, you will appreciate having studied DSP.


Don't worry, this is a domain of magic matlab functions and excel data analysis and multiply-named (separately invented about four times on average in different fields) terms for the same thing, with incomprehensible jargon and no simple explanation untainted by very specific industry application.


Agreed, a single-pole IIR filter would be cheaper and have a better frequency domain.


Please elaborate.


For both alternatives we begin by computing how far the mouse has gone:

    int m = abs(dx) + abs(dy);   // Manhattan distance
For the single-pole RC exponential filter as WanderPanda suggested:

    c -= c >> 5;                 // exponential decay without a multiply (not actually faster on most modern CPUs)
    c += m;
For the box filter with the running-sum table as nostrademons suggested:

    s += m;                      // update running sum
    size_t j = (i + 1) % n;      // calculate index in prefix sum table to overwrite
    int d = s - t[j];            // calculate sum of last n mouse movement Manhattan distances
    t[j] = s;
    i = j;
Here c, i, s, and t are all presumed to persist from one event to the next, so maybe they're part of some context struct, while in old-fashioned C they'd be static variables. If n is a compile-time constant, this will be more efficient, especially if it's a power of 2. You don't really need a separate persistent s; that's an optimization nostrademons suggested, but you could instead use a local s at the cost of an extra array-indexing operation:

    int s = t[i] + m;
Depending on context this might not actually cost any extra time.

Once you've computed your smoothed mouse velocity in c or d, you compare it against some kind of predetermined threshold, or maybe apply a smoothstep to it to get the mouse pointer size.

Roughly I think WanderPanda's approach is about 12 RISCish CPU instructions, and nostrademons's approach is about 18 but works a lot better. Either way you're probably looking at about 4-8 clock cycles on one core per mouse movement, considerably less than actually drawing the mouse pointer (if you're doing it on the CPU, anyway).

Does that help?


> they must have a routine that’s constantly taking mouse input

Possible but unlikely. Well-written desktop software never constantly taking input, it's sleeping on OS kernel primitives like poll/epoll/IOCP/etc waiting for these inputs.

Operating systems don't generate mouse events at 1kHz unless you actually move the mouse.


“Constantly taking” is not the same thing as “constantly polling”. The ring buffer approach works identically in the event-driven approach, you just need to calculate the number of “skipped” ticks and zero them out in the ring buffer.


Simply moving my USB mouse consumes 10% of my CPU. Computers these days...


Back in the day, moving your mouse made things look like they were processing faster.


Not just _look like_ — on Windows 95, moving the mouse _actually_ made things process faster, for fairly bizarre reasons.

Source: https://retrocomputing.stackexchange.com/questions/11533/why...


What's behind the mouse cursor while you're doing it? Could it be the UX/UI layer keeping state up-to-date?

Other possibility, do you have a gaming mouse with 1000Hz polling rate configured?


Yeah, it's a high quality mouse. But the only excuse for this is it's slightly cheaper to make everything USB. PS/2 worked much better. It was limited to 200Hz but needed no polling. Motherboards just stopped providing the port.


If the computer has to do anything at all it ads to complexity and it isn't doing other things. One could do something a bit like blue screening and add the mouse pointer to the video signal in the monitor. For basic functionality the computer only needs to know the x and y of clicks. (it could for laughs also report the colors in the area) Hover and other effects could be activated when [really] needed. As a bonus one could replace the hideous monitor configuration menus with a point and click interface.


This polling is not done by the CPU, this is a common misconception. In a typical modern system the only polling that happens with USB is done by the USB host controller and only when there is actual data the host controller generates interrupts for the CPU to process it. Obviously, when you configure the mouse at higher frequency you will get more interrupts and hence higher CPU usage but that has nothing to do with the polling.


Gaming or off the shelf prosumer mobos still have PS/2 on them occasionally. Although, it's probably just using a converter to USB anyway.


Your report rate may be too high.


What does that mean?



And yet MacOS doesn't allow to change the cursor color. On my Windows 10 desktop I set the cursor color to a slightly larger size and yellow color. So much easier to work with.


That seems like an incredibly cheap feature. The mouse pointer png probably costs a lot more than the shake detector.


Abstractions almost always end up leaky. Spark SQL, for example, does whole-stage codegen which collapses multiple project/filter stages into a single compiled stage, but your underlying data format still needs to be memory friendly (i.e. linear accesses, low branching, etc.). The codegen is very naive and the JVM JIT can only do so much.

What I've seen is that you need people who deeply understand the system (e.g. Spark) to be able to tune for these edge cases (e.g. see [1] for examples of some of the tradeoffs between different processing schemes). Those people are expensive (think $500k+ annual salaries) and are really only cost effective when your compute spend is in the tens of millions or higher annually. Everyone else is using open source and throwing more compute at the problem or relying on their data scientists/data engineers to figure out what magic knob to turn.

[1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf


What's wrong with relying on data engineers for data engineering?


Spark is very very odd to tune. Like, it seems (from my limited experience) to have the problems common to distributed data processing (skew, it's almost always skew) but because it's lazy, people end up really confused as to what actually drives the performance problems.

That being said, Spark is literally the only (relatively) easy way to run distributed ML that's open source. The competitors are GPU's (if you have a GPU friendly problem) and running multiple Python processes across the network.

(I'm really hoping that people will now school me, and I'll discover a much better way in the comments).


Data engineers should be building pipelines and delivering business value, not fidgeting with some JVM or Spark parameter that saves them runtime on a join (or for that matter, from what I've seen at a certain bigco, building their own custom join algorithms). That's why I said it's only economical for big companies to run efficient abstractions and everyone else just throws more compute at the issue.


I work in the Analytics space and been mostly on Java and I am so glad that other people feel the same. At this point, people have become afraid of suggesting something other than Spark. I see something written in Rust to be much better at problems like this. I love the JVM but it works well with transactional workloads and starts showing its age when its dealing with analytical loads. The worst thing is then people start doing weak references and weird off the heap processing usually by a senior engineer but really defeats the purpose of the JVM


I guess your company is running on Java and running something else would cost a lot in training, recruiting, understanding, etc. But down the line, defeating the JVM will be understood only by the guy who did it... Then that guy will leave... Then the newcomers will rewrite the thing in Spark 'cos it will feel safer. Rinse-repeat...

(I'm totally speculating, but your story seems so true that it inspired me :-)


Some of it is what you mentioned about training and hiring costs that but mostly its this creation of the narrative that it will scale someday in the future. This is usually done by that engineer(s) and they are good at selling so a different opinion is ignored or frowned upon by the leadership.

I have now seen this anti pattern in multiple places now


> I love the JVM but it works well with transactional workloads and starts showing its age when its dealing with analytical loads.

This is interesting. Can you elaborate a bit?


Analytical loads deal with very large datasets in the order of terabytes even after you compress them. These workloads dont change much so keeping them in the heap eventually results in long processing pauses because JVM tries to recover this memory. However, in most cases, this data is not meant to be garbage collected. For transactions, once you have persisted the data, it should be garbage collected so the pattern works. There are a lot of other aspects that I can probably think of but the above one is the most important in my mind.


Yes, the whole idea of sending “agents” to do processing is poor performing and things like snowflake and Trino, where queries go to already deployed code, run rings around it.

Furthermore, pyspark is by far the most popular and used spark, and it’s also got the absolute world-worst atrocious mechanical sympathy. Why?

Developer velocity trumps compute velocity any day?

(I want the niceness of python and the performance of eg firebolt. Why must I pick?)

(There is a general thing to get spark “off heap” and use generic query compute on the spark sql space, but it is miles behind those who start off there)


Could you elaborate on other systems besides Ballista? (which looks great btw, thank you for sharing)


U-SQL on Azure Data Lake Analytics[0] and ECL from HPCC Systems[1] are two I have fairly extensive experience with. There may be other such platforms, but those and Ballista are the three on my radar.

[0] https://github.com/MicrosoftDocs/azure-docs/blob/master/arti...

[1] https://wiki.hpccsystems.com/pages/viewpage.action?pageId=28...


ECL is the devil. There's so little documentation on it, that you basically need to work for LNRS to actually get any experience with the damn thing.

If it had an SQL layer, I'd spend time evangelising it, but it's not worth learning another query language for.

There exists a world where it got open-sourced before Hadoop was built, and in that world it's probably everywhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: