Hacker Newsnew | past | comments | ask | show | jobs | submit | ebursztein's commentslogin

Thanks for looking in-depth in our post. The Hitachi RTU500 mention is not an hallucination, we did check for those. It is mentioned in the Mandiant threat intelligence data.


Have you considered that Mandiant is wrong? I cannot find any evidence that it would be vulnerable. Hitachi doesn't even appear to be a technology partner of Palo Alto (https://technologypartners.paloaltonetworks.com/English/dire...).

As far as I can tell, the only connection between those is, that CISA released this alert which mentions multiple unrelated advisories in one post. Which happens to be the Siemens Palo Alto and another unrelated Hitachi advisory in RTU500: https://www.cisa.gov/news-events/alerts/2024/04/25/cisa-rele...


Isn't the tool doing its job in that case? I wouldn't generally expect it to independently determine that an otherwise reliable source made a mistake. In fact I feel like that would be a really bad idea.

Imagine if a relatively clueless intern left something out of a report because the textbook "seemed wrong".


I don't really know what its job is to be honest.

Saying that the input data is wrong and the AI didn't hallucinate that data is also kind of a "trust me bro" statement. The Mandiant feed is not public, so I cannot check what was fed to it.

I don't really care why its wrong. It is wrong. And using that as the example prompt in your announcement is an interesting choice.


Try Usearch - it's really fast and under rated https://github.com/unum-cloud/usearch


Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.


Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz


Thank you - we are adding them to our test suit for the next version.


Super, thank you! I look forward to it :)

I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.


Do you have permission to redistribute these files?


LOL nice b8 m8. For the rest of you who are curious, the files look like this:

    <HTML><HEAD>
    <TITLE>Access Denied</TITLE>
    </HEAD><BODY>
    <H1>Access Denied</H1>
     
    You don't have permission to access "http&#58;&#47;&#47;placement&#46;api&#46;test4&#46;example&#46;com&#47;" on this server.<P>
    Reference&#32;&#35;18&#46;9cb0f748&#46;1695037739&#46;283e2e00
    </BODY>
    </HTML>
Legend. "Do you have permission" hahaha.


You are asking what if this guy has "web crawl data" that google does not have?

And what if he says no, he does not have permission.


> You are asking what if this guy has "web crawl data" that google does not have?

No, I'm asking if he has permission to redistribute these files.


Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?

https://en.wikipedia.org/wiki/Fair_use


I'm asking a question.

Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?

If so, could you go ahead and post that zip? I'd like to ingest it in my model.


Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.


I don't see how publicly posting them on a forum is

> the minimum amount of information required to reproduce the bug

MAYBE if they had communicated privately that'd be an argument that made sense.


So you don't think that software development which happens in public web forums deserve fair use protection?


That's an interesting way to frame "publicly posted someone else's data without their consent for anyone to see and download"


I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.


[flagged]


Have fun, buddy!


It's three files that were scraped from (and so publicly available on) the web. That's not at all similar to your strawful analogy.


I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.

Some people just want to argue.

If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.

I don't expect anyone would be so daft.


Literally just asked a question and that seems to have set you off, bud. Are you alright? Do you need to feed your LLM more data to keep it happy?


I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?


I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)


I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.


Perhaps I misread "Maybe take a walk and get some fresh air?" - no worries though.


I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.


Literally all you did is bitch and moan about someone asking a simple question, lol. Go touch grass.


I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!


no thanks, not interested in your american nonsense laws. lecturing people who are asking SOMEONE ELSE a question is a terrible personality trait btw


181 out of 195 countries and counting!

https://en.wikipedia.org/wiki/Berne_Convention

Look at that map!

https://upload.wikimedia.org/wikipedia/commons/7/76/Berne_Co...

P.S. Berne doesn't sound like a very American name.

You would really learn a lot from reading Groklaw. Of course, I can't make you. Good luck in the world though!


man, you really are putting a lot of effort into justifying stealing other people's content


Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.


If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.


Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?


> why would i care to learn about the laws of some other dude's country

The website you're attempting to police other people's behavior on is hosted in the country you're complaining about. Lol.

Maybe there is a website local to your country where your ideas would be better received?


You're so brave


Thanks!


What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs 
List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage


I’m not sure what this comment is trying to say


File-based hashing is done is so many places, there's so much heat.

Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.

AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.

https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.

Also otoh with a time limit,

1. What file is this? Dirname, basename, hashes(s)

2. Is it supposed to be installed at such path?

3. Per it's header, is the file an archive or an image or a document?

4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?


Co-author of Magika here (Elie) so we didn't include the measurements in the blog post to avoid making it too long but we did those measurements.

Overall file takes about 6ms (single file) 2.26ms per files when scanning multiples. Magika is at 65ms single file and 5.3ms when scanning multiples.

So Magika is for the worst case scenario about 10x slower due to the time it takes to load the model and 2x slower on repeated detection. This is why we said it is not that much slower.

We will have more performance measurements in the upcoming research paper. Hope that answer the question


Is that single-threaded libmagic vs Magika using every core on the system? What are the numbers like if you run multiple libmagic instances in parallel for multiple files, or limit both libmagic and magika to a single core?

Testing it on my own system, magika seems to use a lot more CPU-time:

    file /usr/lib/*  0,34s user 0,54s system 43% cpu 2,010 total
    ./file-parallel.sh  0,85s user 1,91s system 580% cpu 0,477 total
    bin/magika /usr/lib/*  92,73s user 1,11s system 393% cpu 23,869 total
Looks about 50x slower to me. There's 5k files in my lib folder. It's definitely still impressively fast given how the identification is done, but the difference is far from negligible.


Do you have a sense of performance in terms of energy use? 2x slower is fine, but is that at the same wattage, or more?


That sounds like a nit / premature optimization.

Electricity is cheap. If this is sufficiently or actually important for your org, you should measure it yourself. There are too many variables and factors subject to your org’s hardware.


The hardware requirements of a massively parallel algorithm can't possibly be "a nit" in any universe inhabited by rational beings.


Totally disagree. Most end users are on laptops and mobile devices these days, not desktop towers. Thus power efficiency is important for battery life. Performance per watt would be an interesting comparison.


What end users are working with arbitrary files that they don’t know the identification of?

This entire use case seems to be one suited for servers handling user media.


File managers that render preview images. Even detecting which software to open the file with when you click it.

Of course on Windows the convention is to use the file extension, but on other platforms the convention is to look at the file contents


> on other platforms the convention is to look at the file contents

MacOS (that is, Finder) also looks at the extension. That has also been the case with any file manager I've used on Linux distros that I can recall.


You might be surprised. Rename your Photo.JPG as Photo.PNG and you'll still get a perfectly fine thumbnail. The extension is a hint, but it isn't definitive, especially when you start downloading from the web.


Browsers often need to guess a file type


Theoretically? Anyone running a virus scanner.

Of course, it's arguably unlikely a virus scanner would opt for an ML-based approach, as they specifically need to be robust against adversarial inputs.


> it's arguably unlikely a virus scanner would opt for an ML-based approach

Several major players such as Norton, McAfee, and Symantec all at least claim to use AI/ML in their antivirus products.


You'd be surprised what an AV scanner would do.

https://twitter.com/taviso/status/732365178872856577


I mean if you care about that you shouldn't be running anything that isn't highly optimized. Don't open webpages that might be CPU or GPU intensive. Don't run Electron apps, or really anything that isn't built in a compiled language.

Certainly you should do an audit of all the Android and iOS apps as well, to make sure they've been made in a efficient manner.

Block ads as well, they waste power.

This file identification is SUCH a small aspect of everything that is burning power in your laptop or phone as to be laughable.


Whilst energy usage is indeed a small aspect this early on when using bespoke models, we do have to consider that this is a model for simply identifying a file type.

What happens when we introduce more bespoke models for manipulating the data in that file?

This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.


That's a slippery slope argument, which is a common logical fallacy[0]. This model being inefficient compared to the best possible implementation does not mean that future additions will also be inefficient.

It's the equivalent to saying many people programming in Ruby is causing all future programs to be less efficient. Which is not true. In fact, many people programming in Ruby has caused Ruby to become more efficient because it gets optimised as it gets used more (or Python for that matter).

It's not as energy efficient as C, but it hasn't caused it to get worse and worse, and spiral out of control.

Likewise smart contracts are incredibly inefficient mechanisms of computation. The result is mostly that people don't use them for any meaningful amounts of computation, that all gets done "Off Chain".

Generative AI is definitely less efficient, but it's likely to improve over time, and indeed things like quantization has allowed models that would normally to require much more substantial hardware resources (and therefore, more energy intensive) to be run on smaller systems.

[0]: https://en.wikipedia.org/wiki/Slippery_slope


That is a fallacy fallacy. Just because some slopes are not slippery that does not mean none of them are.


The slippery slope fallacy is: "this is a slope. you will slip down it." and is always fallacious. Always. The valid form of such an argument is: "this is a slope, and it is a slippery one, therefore, you will slip down it."


No, it isn't.


Yeah. Yeah, it is.


>This feels like it could slowly boil to the point of programs using magnitudes higher power, at which point it'll be hard to claw it back.

We're already there. Modern software is, by and large, profoundly inefficient.


In general you're right, but I can't think of a single local use for identifying file types by a human on a laptop - at least, one with scale where this matters. It's all going to be SaaS services where people upload stuff.


We are building a data analysis tool with great UX, where users select data files, which are then parsed and uploaded to S3 directly, on their client machines. The server only takes over after this step.

Since the data files can be large, this approach bypasses having to trnasfer the file twice, first to the server, and then to S3 after parsing.


This dont sound like very common scenario.


Thanks for the list, we will probably try to extend the list of format supported in future revision.


Indeed but as pointed out in the blog post -- file is significantly less accurate that Magika. There are also some file type that we support and file doesn't as reported in the table.


I can't immediately find the dataset used for benchmarking. Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?


> Is file actually failing on common files or just particularly nasty examples? If it's the latter then how does it compare to Magika on files that an average person is likely to see?

That's not the point in file type guessing is it? Google employs it as an additional security measure for user submitted content which absolutely makes sense given what malware devs do with file types.


Thanks :)


We did release the npm package because indeed we create a web demo and thought people might want to also use it. We know it is not as fast as the python version or a C++ version -- which why we did mark it as experimental.

The release include the python package and the cli which are quite fast and is the main way we did expect people to use -- sorry if that hasn't be clear in the post.

The goal of the release is to offer a tool that is far more accurate that other tools and works on the major file types as we hope it to be useful to the community.

Glad to hear it worked on your files


Thank you for the release! I understand you're just getting it out the door. I just hope to see it delivered as a native library or something more reusable.

I did try the python cli, but it seems to be about 30x slower than `file` for the random bag of files I checked.

I'll probably take some time this weekend to make a couple of issues around misidentified files.

I'll definitely be adding this to my toolset!


For those interested I did put Crowley original tarot deck on the Etteilla Foundation website https://etteilla.org/en/deck/69/crowley-s-thoth-tarot


Here are the slides of my recent talk at FIC on how Google uses AI to strengthen Gmail's document defenses and withstand attacks that evade traditional antivirus solutions.

This talk recounts how in the last few years we researched and developed a specialized office document scanner that combines a custom document analyzer with deep-learning to detect malicious docx and xls that bypass standard AVs. In 2021 our AI scanner was able to detect 36% additional malicious documents that eluded other scanners on average and 178% at peak performance.

I hope you will find those slides useful and informative. If you have any questions, please ask away, will do my best to answer :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: