That (very interesting) pvk.ca blog post seems consistent with the schani.wordpress.com one. In particular, while pvk mentions preferring linear search and schani mentions preferring binary search, those preferences are for slightly different cases. schani says:
> If you’re very, very serious about performance and know the array size statically, completely unroll the binary search.
and pvk says:
> I’ll focus on one specific case that is of interest to me: searching a short or medium-length sorted vector of known size.
"Known" here means "known at compile time". And of course he concludes:
> In general, I’d just stick to binary search: it’s easy to get good and consistent performance from a simple, portable implementation.
Both bloggers use conditional moves to avoid conditional branches; use unrolled versions; and of course use benchmarks to arrive at their conclusions.
I had two points in that post. The first, obvious, one is that binary search can be micro-optimised to combine decent algorithmic properties with an enviable constant factor.
The second one is that, when linear search outperforms binary search, it does so by breaking out of the search, which needs conditional branches. Binary search, despite its bad reputation, has conditional branches that are easily converted to conditional moves or masks; even a loopy implementation is amenable to trip count prediction (it's a function of the log of the size of the array). If we must avoid mispredicted branches, binary search is intrinsically a better option than linear search.
Modern architectures really necessitate a good understanding of the instruction pipeline and caches to squeeze out the best performance.
If I remember correctly, Python's hashtables are initialized with 8 buckets that are linearly searched and then switched to a real hashtable implementation when grown past that size.
I have worked with the L4 microkernel where sooo much emphasis was put on keeping instruction and data footprints as small as possible every time the kernel is entered in order not to dirty i- and d-caches.
And I have also seen game engine developers do amazing things in this regard. An interesting development in the gaming space is data-oriented-design that deviates from OOP among other things for performance and parallelization. See http://www.slideshare.net/mobile/cellperformance/data-orient... (though I don't agree with the three "lies" mentionened, I do like the data-centric approach).
True, if you're out of cache, there's a goldilocks zone where linear search works better. I mostly disregard that because *chotomic search is equivalent/quicker in cache and for small arrays, and quicker for large arrays.
>> ...use linear search if your array is below around 64 elements in size, binary search if it’s above.
Back around 1980 when I first looked into this the generally accepted cutoff was 25 rather than 64. I don't think people were unrolling loops in the tests back then so it's hard to tell whether loop unrolling or changes in CPU architecture is the greater factor in moving the cutoff point from 25 to 64.
This is a bit contrived example because if just searching for an integer in a large array, the memory bandwidth is going to be the bottle neck.
When dealing with small arrays (like the example of 0-250 integers, ie. less than 1 kilobyte), the figures are around 20-120 nanoseconds. In other words, the difference is around one main memory reference ("cache miss") between the best and the worst case.
While this is somewhat interesting, the lesson to take home should be that most of the time, smart memory usage is more important than what the CPU executes. Warming up the caches (e.g. __builtin_prefetch) is as efficient an optimization as rewriting and unrolling the entire loop and should be done first. If the caches are warm and the search is still the bottleneck, then it's time to consider the other optimizations.
Getting stuff out of cache is also worthwhile in some cases. I had some message passing between cores, and apparently the CPU wasn't good at telling that my chance of using a populated and sent message slot again was zero. Adding a clflush after send significantly reduced my number of cache misses.
This is an aspect of multi-core programming. As the programmer, you must be painfully aware of the cache lines you're touching and which cores touch them. Having two or more cores touch a cache line will cause expensive core-to-core synchronization taking place. The case is easier in a message passing system like yours, where the other core is clearly the sender and can flush the caches. In kernel space, you can disable caches or use write-combine or write-through caches for memory regions as needed but this isn't really available in userspace.
This can happen in very unintuitive places. For example, in Java, array.size() reads from the same cache line as the beginning of the array. If you're partitioning an array processing routine for many cores, you should not call array.size() to avoid very expensive synchronization.
I also think he should have covered the time it would take to sort an array. A linear search wouldn't need to be sorted, but a binary search would. He made his cut off at around 64 elements, but if you'd include the time needed by binary search to sort the array, the cut off limit would be a higher than that -- but would also introduce a lot of other complexities to consider in a production system.
I really doubt there's any data size for which sort + binary search is going to be faster than linear search. The former has higher complexity (so it's going to be slower for large inputs) and more expensive operations (so it will be slower for small inputs).
You can also think about it as a sort of reductio ad absurdum: Assume that the array is actually already sorted and you use a sorting algorithm with O(n) complexity for already sorted data. In this ideal case you'd need to do n (for sorting) + log n (binary search) predictable operations. For linear search you'd only need to n operations of the same type. In practice you'd need to do n log n operations for sorting, I don't think you can avoid unpredictable branches, and instead of just loading elements sometimes you'll move them around.
I think the assumption is that you sort once and search several times (enough to amortize the cost of the sort). The linear search algorithm also had an early exit, so the array has to be sorted for that too. That makes this a fair comparison, but I agree that seeing a linear unsorted search would have been interesting too.
Binary Search eliminates Branch Mispredictions http://www.pvk.ca/Blog/2012/07/03/binary-search-star-elimina...