Yes and no, it took me from 8am to 3am once we decided it needed to get fixed but really it sat on the app for years, it only happened on a background process that sent print jobs on a timer, since it used Windows GDI to compose the image we sent to the printer it was affected (our "frontend" should've been affected too but never was, I guess because it had a different memory usage pattern).
We just had it restart itself and try again whenever it got one of those errors when printing but eventually we wanted to add a feature that required the process to not die, and by that time I was already 99% sure that it wasn't something in our code and I had already ruled out threading issues.
I ended up putting it in a VM with a kernel debugger attached and having a script make a snapshot and make it print over and over until it errored, then following along in IDA until I saw what was going on.
Having a way to trigger it (by restoring the snapshot) on demand helped a lot, otherwise it would have taken forever to make sense of it as it could sit without crashing for nearly an hour.
In Hyper-V it's fairly easy. You make a virtual serial port ("COMPort"), set the bootloader to enable kernel debugging over serial, then connect to the virtual serial port from the host via a named pipe.
Kernel debugging over serial should be possible in vSphere, but Ethernet is easier to set up:
1. Make sure at least one virtual NIC on the target VM (with IP connectivity to the debug host machine/VM) is on Microsoft's NIC whitelist[1]. I use e1000e; note that vmxnet3 is not on the list.
2. Follow Microsoft's directions[2] to connect.
I can confirm this works on vSphere[3] and there's no reason it shouldn't also work on VMware Workstation, Player, and Fusion.
[3] Tested last week with a Windows Server 2022 target (e1000e virtual NIC) and Windows 10 debug host (vmxnet3 virtual NIC), both running on ESXi 8.0 Update 1 VM hosts.
Hyper-V wasn't until Windows Server 2008. I know you could do virtual serial ports w/ VMware GSX and ESX (and later ESXi) forwarded to real hardware serial ports on the host.
We just had it restart itself and try again whenever it got one of those errors when printing but eventually we wanted to add a feature that required the process to not die, and by that time I was already 99% sure that it wasn't something in our code and I had already ruled out threading issues.
I ended up putting it in a VM with a kernel debugger attached and having a script make a snapshot and make it print over and over until it errored, then following along in IDA until I saw what was going on.
Having a way to trigger it (by restoring the snapshot) on demand helped a lot, otherwise it would have taken forever to make sense of it as it could sit without crashing for nearly an hour.