Some technical background: Open-Meteo offers free weather APIs for a while now. ...

O__________O · on July 6, 2022

Amazing you were able to get data from 300-GB to 10-GB, impressive!

Radar-Data: Find the most obvious gaps in predicting short-term weather are related to radar data. Obviously radar datasets would be require massive storage space, but curious if you have run across any free sources for archival radar data or APIs for real-time streams; or open source code from scrapping existing services radar feeds.

meteo-jeff · on July 6, 2022

`300 GB` to `10 GB` was bit over optimistic ;-) 300 GB already included 3 weeks of data. `100 GB` to `10 GB` is a more realistic number.

Many weather variables like precipitation or pressure are very easy to compress. Variables like solar radiation are more dynamic and therefore less efficient to compress.

Getting radar data is horrible... In some countries like the US or Germany, it is easy, but many other countries do not offer open-data radar access. For the time being, I will integrate more open datasets first

hrgiger · on July 6, 2022

I wonder if with bitpacking you can achieve even higher ratio, considering each temp has 3 digit and temp range 51.2 to -51.2 if reasonable range 1 for signature 9 temp bit could store 3 temp in an integer. Deltas might consume less range maybe but might need extra bit tweak, afaik fastpfor also does similar run with simd , but what i understand time is not your main concern.

Edit: just read the 0.04-0.02 range , if I understand right only putting 1 real temp and then deltas could fit 12 first int and 16 temp following ints? Quick napkin math , could be wrong:)

meteo-jeff · on July 6, 2022

Yes, it is a combination of delta coding, zigzag, bitpacking and outliner detection.

It only works well for integer compression. For text-based data, results are not use-full.

SIMD and decompression speed is an important aspect. All forecast APIs use the compressed files as well. Previously I was using mmap'ed float16 grids, which were faster, but took significantly more space.

subleq · on July 6, 2022

Would flac work for compression? Given the weather data is a time series of numbers it could be represented as audio. It would then automatically do the difference encoding thing you’re doing.

If you encoded nearby grid cells as audio channels, flac would even handle the correlation like it does for stereo audio.

omega3 · on July 6, 2022

3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.

Can you expand on this?

meteo-jeff · on July 6, 2022

Sure. I bundle a small rectangle of neighbouring locations like 5x5 (= 25 locations). The actual weather model may have a grid like 2878x1441 cells (4 million).

Inside the 5x5 chunk, I subtract all grid-cells from the center grid-cell. The borders will then contain only the difference to the center grid-cell.

Because the values of neighbouring grid-cells are similar, the resulting deltas are very small and better compressible.

Brometheus · on July 6, 2022

Hello, good work.

Please investigate if you would like to work for this company: https://www.energymeteo.de/ueber_uns/jobs.php

O__________O · on July 6, 2022

Feedback:

(1) Maybe it’s just me, but the “current jobs” are only available in German, if you switch to English, Spanish or French — the page gets translated, but the three “current jobs” drop down lists get removed; super confusing, since it gets reset to German if you click “current jobs” from any of the other pages;

(2) HN is an English site, would be nice if you were linking to the English page, not German;

(3) if you’re affiliated with the company, which I believe you are, you should say so and noting it in your profile with contact information would be nice too.

(4) Reminder that HN has free job postings every month if you are affiliated with the company:

https://news.ycombinator.com/submitted?id=whoishiring

Brometheus · on July 6, 2022

Thank you for your feedback.

checkurprivlege · on July 6, 2022

[flagged]

Brometheus · on July 6, 2022

I think these are reasonable suggestions from a perspective of someone who wants this site to become better.

lmc · on July 6, 2022

Interesting. Have you come across TileDB before?

https://tiledb.com/

meteo-jeff · on July 6, 2022

I have not. It looks promising as it seems to offer multi dimensional data storage and some compression aspects.

I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.

In the past I used InfluxDB, TimescaleDB and ClickHouseDB. They also offer good solutions for time-series data, but add a lot of maintenance overhead.

lmc · on July 6, 2022

Thanks for your response. I have no affiliation, it just piqued my interest.

> I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.

I also like this approach :-)