Thanks for looking in-depth in our post. The Hitachi RTU500 mention is not an hallucination, we did check for those. It is mentioned in the Mandiant threat intelligence data.
As far as I can tell, the only connection between those is, that CISA released this alert which mentions multiple unrelated advisories in one post. Which happens to be the Siemens Palo Alto and another unrelated Hitachi advisory in RTU500: https://www.cisa.gov/news-events/alerts/2024/04/25/cisa-rele...
Isn't the tool doing its job in that case? I wouldn't generally expect it to independently determine that an otherwise reliable source made a mistake. In fact I feel like that would be a really bad idea.
Imagine if a relatively clueless intern left something out of a report because the textbook "seemed wrong".
Saying that the input data is wrong and the AI didn't hallucinate that data is also kind of a "trust me bro" statement.
The Mandiant feed is not public, so I cannot check what was fed to it.
I don't really care why its wrong. It is wrong. And using that as the example prompt in your announcement is an interesting choice.
Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.
For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.
We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.
Thanks again for sharing your experience with Magika this is very useful.
Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff
These are files that were in one of my crawl datasets.
I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.
LOL nice b8 m8. For the rest of you who are curious, the files look like this:
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://placement.api.test4.example.com/" on this server.<P>
Reference #18.9cb0f748.1695037739.283e2e00
</BODY>
</HTML>
Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?
Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?
If so, could you go ahead and post that zip? I'd like to ingest it in my model.
Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.
I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.
I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.
Some people just want to argue.
If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.
I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?
I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)
I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.
I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.
I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!
Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.
If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.
Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?
And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file,
and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.
A sufficient cryptographic hash function yields random bits with uniform probability.
DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator.
Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?
> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.
Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.
Add'l useful formats:
> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories
File-based hashing is done is so many places, there's so much heat.
Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.
AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.
https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.
Also otoh with a time limit,
1. What file is this? Dirname, basename, hashes(s)
2. Is it supposed to be installed at such path?
3. Per it's header, is the file an archive or an image or a document?
4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?
Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.
Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.
So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.
We will have more performance measurements in the upcoming research paper. Hope that answer the question
Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?
Testing it on my own system, magika seems to use a lot more CPU-time:
file /usr/lib/* 0,34s user 0,54s system 43% cpu 2,010 total
./file-parallel.sh 0,85s user 1,91s system 580% cpu 0,477 total
bin/magika /usr/lib/* 92,73s user 1,11s system 393% cpu 23,869 total
Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.
Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.
Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.
You might be surprised. Rename your Photo.JPG as Photo.PNG and you'll still get a perfectly fine thumbnail. The extension is a hint, but it isn't definitive, especially when you start downloading from the web.
Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.
I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.
Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.
Block ads as well, they waste power.
This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.
Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.
What happens when we introduce more bespoke models for manipulating the data in that file?
This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.
That's a slippery slope argument, which is a common logical fallacy[0]. This model being inefficient compared to the best possible implementation does not mean that future additions will also be inefficient.
It's the equivalent to saying many people programming in Ruby is causing all future programs to be less efficient. Which is not true. In fact, many people programming in Ruby has caused Ruby to become more efficient because it gets optimised as it gets used more (or Python for that matter).
It's not as energy efficient as C, but it hasn't caused it to get worse and worse, and spiral out of control.
Likewise smart contracts are incredibly inefficient mechanisms of computation. The result is mostly that people don't use them for any meaningful amounts of computation, that all gets done "Off Chain".
Generative AI is definitely less efficient, but it's likely to improve over time, and indeed things like quantization has allowed models that would normally to require much more substantial hardware resources (and therefore, more energy intensive) to be run on smaller systems.
The slippery slope fallacy is: "this is a slope. you will slip down it." and is always fallacious. Always. The valid form of such an argument is: "this is a slope, and it is a slippery one, therefore, you will slip down it."
In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.
We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.
Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.
Indeed but as pointed out in the blog post -- file is significantly less accurate that Magika. There are also some file type that we support and file doesn't as reported in the table.
I can't immediately find the dataset used for benchmarking. Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?
> Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?
That's not the point in file type guessing is it? Google employs it as an additional security measure for user submitted content which absolutely makes sense given what malware devs do with file types.
We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.
The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.
The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.
Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.
I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.
I'll probably take some time this weekend to make a couple of issues around misidentified files.
Here are the slides of my recent talk at FIC on how Google uses AI to strengthen Gmail's document defenses and withstand attacks that evade traditional antivirus solutions.
This talk recounts how in the last few years we researched and developed a specialized office document scanner that combines a custom document analyzer with deep-learning to detect malicious docx and xls that bypass standard AVs. In 2021 our AI scanner was able to detect 36% additional malicious documents that eluded other scanners on average and 178% at peak performance.
I hope you will find those slides useful and informative. If you have any questions, please ask away, will do my best to answer :)