More

psykotic · on Aug 2, 2023

> Most self taughts lack broader concepts, abstractions and especially algorithms.

As a self-taught who later went to university for pure mathematics (which I got interested in through game programming), that was not my experience with the late 90s, early 2000s generation of self-taught game programmers. I think you have to go back 5 to 10 years earlier for that stereotype to have merit; the transition from 2D to 3D filtered out some people.

In the post-Quake generation, the sexy thing for budding game programmers was 3D graphics programming. For the most part you had no choice [1] but to read, understand and implement SIGRAPH papers and doctoral dissertations to learn the ropes, which in turn forced us to learn a lot of the formal prerequisites albeit unevenly. And there were a lot of advanced data structures involved with visibility determination (and later level of detail), which was the focal point of graphics in the Quake 1 and 2 era before it fell to the wayside once the dominance of GPUs made more brute-force approaches the right choice. So whether you liked it or not, you had to master a decent swath of university-level CS and math.

The main thing self-taughts generally lacked when joining the industry were the same thing all new employees lacked: they don't yet know how to work in teams, they don't know how to manage time, scope and complexity (though it helps if you've worked on larger personal projects of your own), they cannot make long-term engineering trade-offs (reliability, cost of making changes early vs late, flexibility vs performance, etc) because they haven't gone through full project cycles, and so on.

[1] There were a few useful resources like Abrash's Dr. Dobbs articles on Quake which were partially pre-digested for easier consumption by programmers, but it was still daunting if you didn't understand the math. And the rate of progress was so high in this period that the few reliable sources of how-to information were so quickly out of date that you _had_ to keep up with the academic research and experiment by implementing and improving it yourself.

psykotic · on July 27, 2023

LTO for LLVM/clang and gcc is implemented by getting the compiler to emit internal compiler IR code rather than machine code to the object files. The linker's job is to call into the compilers at link time with the serialized IR code from the object files to produce machine code; the linker does not do the link-time optimization itself. Therefore LTO support in a linker is a pretty binary feature (does it support X compiler on Y platform?) without much of a "good/better" gradation. And when it comes to that, mold implements LTO support for both gcc and LLVM on Unix-like systems.

psykotic · on Jan 15, 2023

Let's go to primary sources: https://developer.arm.com/documentation/ddi0487/latest

The author is right. ARMv8 supports relaxed memory ordering but its memory model does not support acquire-release ordering without sequential consistency. (Update: Support for weaker acquire-release ordering was added in a later revision.)

I'll quote a relevant excerpt from B2.3.11:

    Where a Load-Acquire appears in program order after a Store-Release, the memory access generated by the Store-Release instruction is Observed-by each PE to the extent that PE is required to observe the access coherently, before the memory access generated by the Load-Acquire instruction is Observed-by that PE, to the extent that the PE is required to observe the access coherently.

Sequential consistency is needed to avoid store/load reordering in cases like this:

    // Thread 1
    flag1.store(1, SeqCst)
    if flag2.load(SeqCst) == 0 {
        // Guarded action
    }

    // Thread 2
    flag2.store(1, SeqCst)
    if flag1.load(SeqCst) == 0 {
        // Guarded action
    }

If these were instead implemented with acquire/release ordering as defined by the C++ or Rust memory model, the resulting happens-before constraints would not prevent both threads from executing their guarded actions.

The excerpt from the Architecture Reference Manual says that if you use their load-acquire (ldar) and store-release (stlr) instructions, it is not possible for the store-release to be moved after the load-acquire, as observed by PEs (processing elements, their abstraction of hardware threads).

Let's look at how C++ compilers implement acquire-release vs sequential consistency on x86 and ARMv8:

https://godbolt.org/z/3fd5jse18

The machine code on ARMv8 is identical for thread_acq_rel and thread_seq_cst. Whereas on x86 the thread_seq_cst code has to use xchg (an alternative to store + mfence) to achieve sequential consistency.

Update: shachaf pointed out that ARMv8 more recently added support for weaker acquire-release semantics in the ARMv8.3 revision. It looks like the first processor to ship with ARMv8.3 support was the A12X from Apple in 2018, which is 5 years after Herb's talk. If we take the code from before and compile for ARMv8 with all architectural features enabled, you will see different machine code for thread_acq_rel which uses the newer ldaprb instruction:

https://godbolt.org/z/dnP9sebcz

This illustrates a difficulty with talking about "ARMv8" as a fixed thing. It's much more of a rapidly moving target than x86. That said, the ARMv8.3 addendum should have been mentioned, at least parenthetically; I emailed the author suggesting an info box.

psykotic · on May 29, 2022

> If you want to refactor, it's up to you to delete the types.

The fact that

    f(g(x))

and

    { let y = g(x); f(y) }

are equally robust to changes in g(x)'s type is an advantage of type inference. It doesn't seem reasonable to place an additional burden on the person who chose to assign a name to the subexpression value.

> Personally I really like the spot readability of explicit types.

You can use an IDE that displays inferred types as type inlays (for variables) and as sidebar info (for expressions). But I've often wanted a way to export such sideband data (including subexpression types and cross-reference links) so you could display it for code reviews, etc, without teaching those tools about your language specifically, so I'm sympathetic to the general goal.

Regarding explicitness and readability, I like Rust's idea of letting you partially specify types by using _ as a placeholder to say "infer this part". For example, you can specify that something is a Vec without specifying its item type:

    let v: Vec<_> = iter.collect();

mike_hock · on May 29, 2022

Explicit types are a form of assertion. In the example

    { let y = g(x); f(y) }

I do not want to assert anything on the type of y but in

    {
        let y: Foo = g(x);
        // Do complex stuff with y
        f(y)
    }

I wrote the middle part with a specific mental model of what y is, and if that changes, I want to review that section to make sure that the semantics still match what I thought they were and does not just happen to be valid syntactically.

I do not want some automated tool to either strip assertions from my code, nor do I want it to synthesize assertions from a static analysis run of a particular version of the code. If I'm just gonna mindlessly remove all assertions and re-synthesize them on every refactor, they're useless.

rektide · on May 31, 2022

> You can use an IDE that displays inferred types as type inlays

This isn't captured in the code & that is a huge weakness. Being able to see the code change over time is, in my view, a huge advantage.

Reliance on high-tech tools to discern meaning & do the inference for us feels far weaker than simply encoding the good information into the code. That just seems far better to me, far more robust. Some of this is that I'm a vim user and don't want to have to go lookup & configure a LSP daemon for each file I evaluate, but some of it just feels like base common sense, that the tradeoff of actually being explicit are overwhelming & clear.

I think I argued against this pretty resolutely the first time around. I don't think your post adds anything or changes my mind at all.

psykotic · on April 5, 2022

> It's possible to build a hash on tolerant doubles by making two hashes per lookup, and even complex numbers with four per lookup, but no one's figured out how to deal with entries that contain multiple numbers yet.

I'm curious what obstacle you have in mind for "multiple numbers".

The problem is equivalent to a distance query on a data structure, i.e. enumerate everything within a distance tol of a query point x in an appropriate distance metric. There are many data structures for this problem (it's a basic building block in computational geometry). But I assume the hashing solution you have in mind is what we usually call a hash grid in gamedev and graphics, which works well when tol is fixed for the lifetime of the data structure. [1] Namely, you define a grid with cell size tol and map a point x to the index cell(x) of its containing cell, which is just a rounding operation. Then you can use a hash table keyed by cell(x) to store the contents of an unbounded, sparse grid. To perform a distance query from x you need to check not just x's cell but some of the immediately neighboring cells and then filter all of those candidates based on the pairwise distance. [2] [3]

This approach works with any distance metric (including the usual suspects L^2, L^1 and L^inf) and in any dimension d although the worst-case number of cells to check in the hash is 2^d, so the curse of dimensionality is present and in high-dimensional spaces another data structure not based on rectilinear partitioning would be preferable. But hashing does work with "multiple numbers" if by "multiple numbers" you mean a vector (with not too many components) where tolerance is defined by a distance metric.

[1] Actually, hash grids work well any time the query radius is on roughly the same scale as the cell size. But if the query radius is a lot smaller than the cell size then the grid is too coarse to perform adequate pre-filtering. And if the query radius is larger than the cell size you have to check a bigger neighborhood of cells (i.e. more hash lookups) to enumerate all the candidates.

[2] Depending on the read vs write ratio, you can actually flip this around by storing each point in multiple cells so that queries only need one hash table lookup at the expense of slightly less precise pre-filtering.

[3] Instead of doing multiple hash table lookups, you can also have each occupied cell store pointers to neighboring cells (null for unoccupied neighbors), which replaces hash table lookups for the neighbors with plain memory loads. A variation on this trick is to instead store bit flags to avoid doing hash table lookups for neighboring cells which are known to be empty; this takes up far less memory.

psykotic · on March 23, 2022

Their financials are accessible now that they're a publicly traded company. https://investors.unity.com/news/news-details/2022/Unity-Ann...

silverlyra · on March 23, 2022

> The 2021 full-year GAAP results were impacted by […] a charge of $49.8 million related to the termination of a lease agreement.

oof

psykotic · on Dec 1, 2021

Not sure what time scale you had in mind with "many many years" but the commercial third-party extension Visual Assist has been the only serious option for robust and scalable C++ code navigation (and related features) in Visual Studio since the late 1990s, early 2000s until the present day. Visual Studio's built-in Intellisense has been trying to play catch up (and has somewhat closed the gap), not the other way around. There were also widely used products like Source Insight focusing on offline whole-program code analysis and navigation, but they weren't really designed for integrated development.

That said, your thesis about the relative difficulty of doing this for C++ vs Java seems hard to argue against, and the rapid, continuous growth in C++'s surface area and complexity since C++11 surely hasn't helped.

psykotic · on Nov 21, 2021

I've only profiled fd on Windows but one thing that stood out was that it performed 2 stat syscalls per file via NtQueryInformationFile (that number should be 0 per file on Windows since stat metadata comes for free with NtQueryDirectoryFile from the directory enumeration). When I mentioned this finding on Twitter, someone confirmed that it's also doubling up the stat syscalls on Linux. But if the OP is actually trying to benchmark raw directory enumeration speed vs git ls-files, they should make sure they're benchmarking against something that's not making per-file stat calls at all.

kelnos · on Nov 22, 2021

> But if the OP is actually trying to benchmark raw directory enumeration speed vs git ls-files, they should make sure they're benchmarking against something that's not making per-file stat calls at all.

I think OP is trying to benchmark which tool is fastest/most efficient for his workflow. If one of the tools has bugs (or intentional, but unnecessary behavior) that slow it down unnecessarily, that's great if they're fixed, but doesn't help if they're not.

I do think it's going to be pretty hard for a directory walker to do better than a pre-made index listing of the files, though, even with a warm filesystem cache.

cb321 · on Nov 22, 2021

This is true and an ok point, but the writing of the discussed article even has a subtitle: "Git ls-files is 5 times faster than fd or find, but why?"

My answer is "at least partly because `fd` & `find` are both slow - for different reasons". You are never going to do better than reading a saved answer, but I only get a 1.8x hit for not having an index { which needs maintenance as has been pointed out by almost everyone :-) }. `walk`, linked elsewhere, is less of a strawman comparison but could probably be optimized a bit more.

cb321 · on Nov 22, 2021

So, when I do

    linux.git$ strace -fc fd -j1 --no-ignore --hidden --exclude .git --type file --type symlink>/dev/null
    strace: Process 24100 attached
    strace: Process 24101 attached
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ------------------
     61.34    0.382389       21243        18         2 futex
     26.50    0.165192           2     74307           write
      4.72    0.029454           3      9754           getdents64
      2.22    0.013846           2      4884           openat
      2.03    0.012677           2      4886           close
      2.02    0.012566           2      4884           newfstatat
      0.96    0.005960           5      1002           clock_gettime
      ...
     ------ ----------- ----------- --------- --------- ------------------
    100.00    0.623428           6     99900         6 total

which suggests that the fd slow down (on Linux, anyway) may be `fd` doing unbuffered writes to stdout (probably to avoid interleave/mixing of multi-threaded output) and not be about the way the file tree recursion is written. The fix is probably for `fd -j1` to do buffered output (and ideally never clone/fork). (The MThread may be unproductive in a different way than my initial futex guess.)

My guess (without looking at the code - honestly Rust kind of hurts my eyeballs) is that this unbuffered mode is not even really guaranteed/sound anyway. If you pipe the output, e.g. `fd -jN|cat>out` with `N>1` somewhere and any deep hierarchy pathname output is bigger than PIPE_BUF you hit trouble. Once those pipe writes lose atomicity, threads can be put to sleep and re-awakened in various orders and wind up interleaving their pipe writes with or without userspace-level buffering.

(filename max)*(open fd default limit)=255*1024 =~ 4*PIPE_BUF. So the "full path print out" may still be capable of being > PIPE_BUF and so still interleave producing non-existent pathnames (which would be so long they would need special care like chdir + chdir... + stat to even verify non-existence). You would very likely have to synthesize a file tree to trigger this failure mode, of course.

I say all the above in terms of a robust tool. In terms of benchmarking walks, it is much easier to just do like `dents dstats` and use the walk to produce a depth histogram instead of mega/gigabytes of path output. If you have some burning need to be multi-threaded it is even easy to histogram in parallel and then merge at the end and the histograms should almost always fit in private L2 CPU caches. Could always be a total count instead of depth histo. AFAIK, GNU find has no `-print`-like `-count` operator to compare, but I'd bet the C internals make such not so hard to add. (Yes, you could use `-exec..+`, but this might be so slow as to make the benchmark questionable.)

Parenthetically, I wonder if any stdlib in any prog.lang does the "chdir loop + stat|other" type of work for users with such pathologically long pathnames for all the file system calls. Seems unlikely, but not impossible, and easy to know when it is needed once a "string knows its length". Food for thought.

psykotic · on Aug 22, 2021

A space-aware syntax can allow optional spacing around operators but make it an error if it does not respect the precedence hierarchy or is deemed inconsistent by other rules.

zem · on Aug 22, 2021

I believe fortress did that.

psykotic · on Aug 19, 2021

> Even full-fledged database systems don't typically implement B+ tree indexes with deletion+rebalancing support (the usual technique is marking nodes deleted and cleaning them up when the index is rebuilt).

Yes, and it took a surprisingly long time before anyone bothered to do the theoretical analysis. Some good papers to check out if anyone is interested are Deletion Without Rebalancing in Multiway Search Trees (http://sidsen.azurewebsites.net/papers/relaxed-b-tree-tods.p...) and Deletion Without Rebalancing in Binary Search Trees (http://sidsen.azurewebsites.net//papers/ravl-trees-journal.p...). Jeff Erickson also has some good notes on using local and global rebuilds for search trees: https://jeffe.cs.illinois.edu/teaching/algorithms/notes/10-s...

It's intuitively clear that worst-case performance [1] can degrade to a function of the number of insertions since a rebuild (as opposed to the number of currently live items) when you use relaxed deletion. If you can afford to globally rebuild the tree you get to play the usual amortization tricks, but global rebuilds can be problematic in a large-scale live database; one solution is to rebuild from a snapshot in the background (on another thread or machine) and then replay late arriving operations before switching over to the rebuilt tree. But if someone is okay with the amortization spikes from rehashing a hash table, they should be okay with non-incremental global rebuilds for search trees; a hash table where deletions leave behind tombstones and the table is rehashed when the ratio of tombstones gets too high is basically the same idea.