Not to mention they have to store the data after they download it. In theory storing garbage data is costly to them. However I have a nagging feeling that the attitude of these scrapers is they get paid the same amount per gigabyte whether it's nonsense or not.
If they even are AI crawlers. Could be just as well some exploit-scanners that are searching for endpoints they'd try to exploit. That wouldn't require storing the content, only the links.
If you look at the pages which are hit and how many pages are hit by any one address in a given period of time it's pretty easy to identify features which are reliable proxies for e.g. exploit scanners, trawlers, agents. I publish a feed of what's being hit on my servers, contact me for details (you need to be able to make DNS queries to a particular server directed at a domain which is not reachable from ICANN's root).