Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Historical weather data API for machine learning, free for non-commercial (openmeteo.substack.com)
196 points by meteo-jeff on July 6, 2022 | hide | past | favorite | 45 comments


Some technical background:

Open-Meteo offers free weather APIs for a while now. Archiving data was not an option, because forecast data alone required 300 GB storage.

In the past couple of weeks, I started to look for fast and efficient compression algorithms like zstd, brotli or lz4. All of them, performed rather poor with time-series weather data.

After a lot of trial and error, I found a couple of pre-processing steps, that improve compression ratio a lot:

1) Scaling data to reasonable values. Temperature has an accuracy of 0.1° at best. I simply round everything to 0.05 instead of keeping the highest possible floating point precision.

2) A temperature time-series increases and decreases by small values. 0.4° warmer, then 0.2° colder. Only storing deltas improves compression performance.

3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.

4) Although zstd performs quite well with this encoded data, other integer compression algorithms have far better compression and decompression speeds. Namely I am using FastPFor.

With that compression approach, an archive became possible. One week of weather forecast data should be around 10 GB compressed. With that, I can easily maintain a very long archive.


Amazing you were able to get data from 300-GB to 10-GB, impressive!

Radar-Data: Find the most obvious gaps in predicting short-term weather are related to radar data. Obviously radar datasets would be require massive storage space, but curious if you have run across any free sources for archival radar data or APIs for real-time streams; or open source code from scrapping existing services radar feeds.


`300 GB` to `10 GB` was bit over optimistic ;-) 300 GB already included 3 weeks of data. `100 GB` to `10 GB` is a more realistic number.

Many weather variables like precipitation or pressure are very easy to compress. Variables like solar radiation are more dynamic and therefore less efficient to compress.

Getting radar data is horrible... In some countries like the US or Germany, it is easy, but many other countries do not offer open-data radar access. For the time being, I will integrate more open datasets first


I wonder if with bitpacking you can achieve even higher ratio, considering each temp has 3 digit and temp range 51.2 to -51.2 if reasonable range 1 for signature 9 temp bit could store 3 temp in an integer. Deltas might consume less range maybe but might need extra bit tweak, afaik fastpfor also does similar run with simd , but what i understand time is not your main concern.

Edit: just read the 0.04-0.02 range , if I understand right only putting 1 real temp and then deltas could fit 12 first int and 16 temp following ints? Quick napkin math , could be wrong:)


Yes, it is a combination of delta coding, zigzag, bitpacking and outliner detection.

It only works well for integer compression. For text-based data, results are not use-full.

SIMD and decompression speed is an important aspect. All forecast APIs use the compressed files as well. Previously I was using mmap'ed float16 grids, which were faster, but took significantly more space.


Would flac work for compression? Given the weather data is a time series of numbers it could be represented as audio. It would then automatically do the difference encoding thing you’re doing.

If you encoded nearby grid cells as audio channels, flac would even handle the correlation like it does for stereo audio.


3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.

Can you expand on this?


Sure. I bundle a small rectangle of neighbouring locations like 5x5 (= 25 locations). The actual weather model may have a grid like 2878x1441 cells (4 million).

Inside the 5x5 chunk, I subtract all grid-cells from the center grid-cell. The borders will then contain only the difference to the center grid-cell.

Because the values of neighbouring grid-cells are similar, the resulting deltas are very small and better compressible.


Hello, good work.

Please investigate if you would like to work for this company: https://www.energymeteo.de/ueber_uns/jobs.php


Feedback:

(1) Maybe it’s just me, but the “current jobs” are only available in German, if you switch to English, Spanish or French — the page gets translated, but the three “current jobs” drop down lists get removed; super confusing, since it gets reset to German if you click “current jobs” from any of the other pages;

(2) HN is an English site, would be nice if you were linking to the English page, not German;

(3) if you’re affiliated with the company, which I believe you are, you should say so and noting it in your profile with contact information would be nice too.

(4) Reminder that HN has free job postings every month if you are affiliated with the company:

https://news.ycombinator.com/submitted?id=whoishiring


Thank you for your feedback.


[flagged]


I think these are reasonable suggestions from a perspective of someone who wants this site to become better.


Interesting. Have you come across TileDB before?

https://tiledb.com/


I have not. It looks promising as it seems to offer multi dimensional data storage and some compression aspects.

I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.

In the past I used InfluxDB, TimescaleDB and ClickHouseDB. They also offer good solutions for time-series data, but add a lot of maintenance overhead.


Thanks for your response. I have no affiliation, it just piqued my interest.

> I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.

I also like this approach :-)


This is how we could defeat a rogue AI: Distract it by talking about the weather.


It would also be fun to have the historical weather *forecasts* so that you can compare the forecasts with the eventually measured data.


Actually, they are historical weather forecasts, but assembled to a continuous time-series.

Storing each weather forecast individually to a performance evaluation for "how good a forecast in 5 days is", would require a lot of storage. Some local weather models update every 6 hours.

But even with a continuous time-series, you can already tell how good or bad a forecast compared to measurements are. Assuming, your measurements are correct ;-)


This would still be a remarkable dataset for learning. And worth the storage. Though it might need other inputs as well (like pressure zone etc.) to escape potential biases.


Hi meteo-jeff, this looks really cool!

I have two questions:

1) How does the spatial resolution come into this? Is it constant data all across the 2kmx2km (?) parcel with an abrupt change, or is it interpolated in some way? Can I query the coordinates of the mesh?

2) How 'historical' does it get? How far back can I go with this?

Thank you!


Hi bernulli,

1) Data are coming from multiple weather models. Primary data source is the German Weather service DWD with the ICON weather model. In my past experiences, the DWD ICON model performs best for many regions. DWD ICON has a global (~13 km), European (7 km) and a Central Europe (1-2 km) "domain". A higher resolution can improve forecast accuracy, but this is not guaranteed.

For Open-Meteo APIs, multiple models are mixed together. Typically high resolution domains only provide 3-5 days of forecast, afterwards they are combined with a global model.

For North American locations, I am going to add high resolution domains from NOAA as-well.

2) For now, only couple of months archive are available. There will be no limit of how much data can be stored. Data is fairly well compressed while still maintaining good read performance.

I am working on a long term archive as well. ECMWF provides a reanalysis dataset called ERA5 [1] with data from 1959. It will still take me a couple of weeks to process it. With 23 weather variables, it requires around 20 TB disk space (Gridded float32 with deflate compression).

[1] https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...


Thank you!

I don’t think I understand the resolution then, could you explain a bit more? Say, I request data along a 100km line, every 10m. Do I get the same numbers if it’s in the same mesh cell with a sudden change when it’s crossed, or do I get some (bilinear?) interpolation?


The tool itself seems to provide upto 3 months of history:

https://open-meteo.com/en/docs#latitude=52.52&longitude=13.4...


Original open-meteo HN thread for background (9 months back) https://news.ycombinator.com/item?id=28499910


Does not seem accurate. This is telling me it snowed 1.33cm on June 17, 2022 in New York City.

https://api.open-meteo.com/v1/forecast?latitude=40.71&longit...


Thanks for the info. Snowfall was recently added. I am afraid there could be a bug with that particular variable.

Temperature, clouds, etc, seem fine

EDIT: Issue identified and will be fixed in the next days! Thanks!


Still trying to predict weather using historical is like trying to predict the next number on a roulette using historical numbers


I think historical data isn’t necessarily applied to forecasting the weather. For example, my first thought with this data is to comb through and build a model where when X conditions exist, then Y airport delays are likely. The FAA doesn’t give the data for their end of the model though.


If yesterday, the weather was clear skies, and the temperature followed a given curve, then today if the weather was clear skies, the temperature is going to follow a very similar curve. Same rough curve if it was clear skies a year ago. The exact values might be a bit different, but the high for the day will probably be about the same amount over the sunrise temp for all 3 of these scenarios. Throw in wind direction, and you could be more accurate with this.

This is why the 14th Weather Squadron creates Wind Stratified Conditional Climotology tables. Past performance is indicative of future results, especially when you're not under the influence of a frontal system.


it's really not


Thanks for offering this service!

You explain your API offers historic data using the "past_days" parameter. Could you also offer a "date" parameter for a given day, or are you only keeping a rolling window of data?


Sure. What do you think about "&start_date=20220701" and "&end_date=20220714"?

If end_date is not specified, it would return start_date with 7 days forecast


check out https://docs.aiqc.io for easy walk-forward, multivariate deep learning: https://docs.aiqc.io/notebooks/gallery/tensorflow/tab_foreca...

excited to play w some of this data


Do you have a commercial option? Does anyone know good alternatives?

What forecast models do you use for Australia?


Hi farmin,

so far I do not offer commercial options, just to keep me out of any potential legal issues. In the next weeks I will review everything, make sure attributions and licenses are correct, and remove the non-commercial limitation.

Australia is currently only covered by a global weather model from German Weather service DWD. I will check if BOM offers some open-data models


Thanks for your reply. I think BoM do offer some stuff. But check out SILO for easy access to historical data. https://www.longpaddock.qld.gov.au/silo/


BoM means $$ even for the last 72hrs


I know GFS is a global scale model as well. If you're already going to be integrating the other NOAA models, might as well do GFS as well?


I think the issue is probably that NOAA data is gathered via a bunch of agreements with other countries, and at least some of them stipulate non-commercial use. Unsure about how vigorously that’s actually enforced.

In you’re curious to read the license text: https://gourdian.net/g/eric/noaa_gsod.global_summary_of_day#...


Looks like "please contact us" (https://open-meteo.com/en/features#terms). So you're in the right place? :)


You can get these for free from the government websites.


Are there any governments publishing world wide data over API? Couldn't find any last time I checked.


Not as an API, but for ML purposes static data (with a quick cleaning script) is perfectly usable. The amount of storage needed for weather data isn't worth spending money on an API. This isn't like financial data where storage could potentially become a problem for the casual user.


Did you train on weather data? And if so, how accurate was/is your forecasts?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: