Actually, it is easy to come up with reasonably decent heuristics that can auto-tag a corpus. From that you can look for anomalies and adjust your tagging system.
The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.
The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.