Open-Meteo offers free weather APIs for a while now. Archiving data was not an option, because forecast data alone required 300 GB storage.
In the past couple of weeks, I started to look for fast and efficient compression algorithms like zstd, brotli or lz4. All of them, performed rather poor with time-series weather data.
After a lot of trial and error, I found a couple of pre-processing steps, that improve compression ratio a lot:
1) Scaling data to reasonable values. Temperature has an accuracy of 0.1° at best. I simply round everything to 0.05 instead of keeping the highest possible floating point precision.
2) A temperature time-series increases and decreases by small values. 0.4° warmer, then 0.2° colder. Only storing deltas improves compression performance.
3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.
4) Although zstd performs quite well with this encoded data, other integer compression algorithms have far better compression and decompression speeds. Namely I am using FastPFor.
With that compression approach, an archive became possible. One week of weather forecast data should be around 10 GB compressed. With that, I can easily maintain a very long archive.
Amazing you were able to get data from 300-GB to 10-GB, impressive!
Radar-Data: Find the most obvious gaps in predicting short-term weather are related to radar data. Obviously radar datasets would be require massive storage space, but curious if you have run across any free sources for archival radar data or APIs for real-time streams; or open source code from scrapping existing services radar feeds.
`300 GB` to `10 GB` was bit over optimistic ;-) 300 GB already included 3 weeks of data. `100 GB` to `10 GB` is a more realistic number.
Many weather variables like precipitation or pressure are very easy to compress. Variables like solar radiation are more dynamic and therefore less efficient to compress.
Getting radar data is horrible... In some countries like the US or Germany, it is easy, but many other countries do not offer open-data radar access. For the time being, I will integrate more open datasets first
I wonder if with bitpacking you can achieve even higher ratio, considering each temp has 3 digit and temp range 51.2 to -51.2 if reasonable range 1 for signature 9 temp bit could store 3 temp in an integer. Deltas might consume less range maybe but might need extra bit tweak, afaik fastpfor also does similar run with simd , but what i understand time is not your main concern.
Edit: just read the 0.04-0.02 range , if I understand right only putting 1 real temp and then deltas could fit 12 first int and 16 temp following ints? Quick napkin math , could be wrong:)
Yes, it is a combination of delta coding, zigzag, bitpacking and outliner detection.
It only works well for integer compression. For text-based data, results are not use-full.
SIMD and decompression speed is an important aspect. All forecast APIs use the compressed files as well. Previously I was using mmap'ed float16 grids, which were faster, but took significantly more space.
Would flac work for compression? Given the weather data is a time series of numbers it could be represented as audio. It would then automatically do the difference encoding thing you’re doing.
If you encoded nearby grid cells as audio channels, flac would even handle the correlation like it does for stereo audio.
3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.
Sure. I bundle a small rectangle of neighbouring locations like 5x5 (= 25 locations). The actual weather model may have a grid like 2878x1441 cells (4 million).
Inside the 5x5 chunk, I subtract all grid-cells from the center grid-cell. The borders will then contain only the difference to the center grid-cell.
Because the values of neighbouring grid-cells are similar, the resulting deltas are very small and better compressible.
(1) Maybe it’s just me, but the “current jobs” are only available in German, if you switch to English, Spanish or French — the page gets translated, but the three “current jobs” drop down lists get removed; super confusing, since it gets reset to German if you click “current jobs” from any of the other pages;
(2) HN is an English site, would be nice if you were linking to the English page, not German;
(3) if you’re affiliated with the company, which I believe you are, you should say so and noting it in your profile with contact information would be nice too.
(4) Reminder that HN has free job postings every month if you are affiliated with the company:
I have not. It looks promising as it seems to offer multi dimensional data storage and some compression aspects.
I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.
In the past I used InfluxDB, TimescaleDB and ClickHouseDB. They also offer good solutions for time-series data, but add a lot of maintenance overhead.
Thanks for your response. I have no affiliation, it just piqued my interest.
> I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.
Open-Meteo offers free weather APIs for a while now. Archiving data was not an option, because forecast data alone required 300 GB storage.
In the past couple of weeks, I started to look for fast and efficient compression algorithms like zstd, brotli or lz4. All of them, performed rather poor with time-series weather data.
After a lot of trial and error, I found a couple of pre-processing steps, that improve compression ratio a lot:
1) Scaling data to reasonable values. Temperature has an accuracy of 0.1° at best. I simply round everything to 0.05 instead of keeping the highest possible floating point precision.
2) A temperature time-series increases and decreases by small values. 0.4° warmer, then 0.2° colder. Only storing deltas improves compression performance.
3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.
4) Although zstd performs quite well with this encoded data, other integer compression algorithms have far better compression and decompression speeds. Namely I am using FastPFor.
With that compression approach, an archive became possible. One week of weather forecast data should be around 10 GB compressed. With that, I can easily maintain a very long archive.