The github code search looks like an impressive piece work (congrats!).
That said, I'm curious about the nuances regarding corpus size. Their blog post claims they have 115 Tb of source code, but that a positional index is "too expensive". A positional index is a 3.5x blow-up, which is ~500 Tb of data. A 1 Tb SSD retails for $50, so that's $25,000 for storing a positional index. 500T of GCP local SSD is also ~25 k$/year. Even if you factor in replication/redundancy, the resource cost is far less than hiring a software engineer. I guess they think machines with local SSD are too much overhead to manage?
The github code search looks like an impressive piece work (congrats!).
That said, I'm curious about the nuances regarding corpus size. Their blog post claims they have 115 Tb of source code, but that a positional index is "too expensive". A positional index is a 3.5x blow-up, which is ~500 Tb of data. A 1 Tb SSD retails for $50, so that's $25,000 for storing a positional index. 500T of GCP local SSD is also ~25 k$/year. Even if you factor in replication/redundancy, the resource cost is far less than hiring a software engineer. I guess they think machines with local SSD are too much overhead to manage?