I then wrote a script that used Harfbuzz to extract every glyph of 75,000 characters as PNG files. That took about 4 months to run on a spare computer, writing 500 GB to an external disk.
I now want to sort out the blank glyphs, but it's really slow even on USB 3. Instead, I bought a 2 TB upgrade for my MacBook Pro, passed down the 512GB to my MacBook Air, and now I'm copying the files from the external disk to the SSD. There's about 90 million files to copy, estimated time remaining 4 days. When I remove the blank images, I plan to use the data to make my own Chinese OCR using Tensorflow.
The TTF files alone are 11.82 GB. If you can recommend a suitable file host, I could re-upload them for you.
Honestly most of what I've done since then has been data collection (song lyrics, movie subtitles, etc) instead of developing new features.
My favourite feature now is to read the song lyrics in church, find 4 characters I know, search my database of Christian song lyrics, load that into Pingtype, and sing along with the pinyin and understand the meaning. It's all automated, but I can't upload it because I've received copyright threats about redistributing the song lyrics. I'm not a limited-liability company (this is a side project) so I'd be personally liable for the consequences of putting it online.
I've done much more research to find new data sources. For example, 9gag helped me stumble upon a translated comic (Mixflavor & HowardInterprets). I transcribed all the comics, and I'm using it with my language exchange tutor every week. I decided I wanted to find more comics that are popular with my friends.
So I extracted my Facebook friends' liked Pages. (Yes, that sounds like Cambridge Analytica, but I did it myself using an AppleScript to scrape and some bash scripts to parse). I found 223,783 pages, in 865 Facebook categories. I reduced the Facebook categories to 30 of my own categories (Art, Music, Cooking, Driving, Pets, Shopping, Religion, etc). Then I found the top pages for each of those. So I know the most popular musicians in Taiwan. That's going to become a blog post and Show HN soon, when the paranoia about Facebook calms down.
We're always looking for good submissions that didn't get much interest, so if anyone knows of others, please email links to hn@ycombinator.com and we'll take a look.
If I may make a suggestion, the part that's of interest to me is what went into the making. I also wouldn't have taken special note of the old submission, because it doesn't tell me anything about how it works or how it came to be, and I have no actual need to translate to or from Chinese. It seems you've been very thoughtful and done a lot of work, which I think would be of interest to people, should you feel the inclination to describe it. All the best :)
Sounds fascinating! I guess you might not have gotten up votes because between the title and the front page of the link, as an English speaker I didn't know what I was looking at or why it was interesting.
The scripts are pretty messy, but the process went like this. It was necessary to use AppleScript because the Graph API doesn't give access to friend's Likes (because of privacy issues e.g. Cambridge Analytica). But AppleScript has access to everything through the GUI. (if anyone from Facebook reads this, please don't ban me - I'm just doing this to find out what my friends here like, so I can learn Chinese. I'm not selling this data!).
1. Get a list of IDs (I must be friends with them).
I manually maintain Lists of friends I met in each country. I went to my Taiwan list, scrolled down, and copied the source into TextWrangler. A few regex find-replace later, I had a list of all my Taiwanese friends' IDs.
3. Scroll down and copy out. This is GUI-intensive, so run it on a spare computer.
tell application "System Events" to tell process "Safari" to key code 119
tell application "Safari" to tell front document to set download_source to do Javascript "document.documentElement.innerHTML;"
Repeat those two while download_source does not contain "<div class=\"_51lb\">". If download_source contains "The page you requested cannot be displayed" then exit repeat.
4. Write it to a file (use cat, not Apple's recommended code, in order to preserve Unicode).
5. Convert to text.
In my case, the HTML files took up 479 MB for 1576 friends. I wrote another script to convert them to text.
Split the HTML based on the "<a class=\"darkTouch _51b5\" href=\"" delimiter.
6. Post-processing!
Now it's time to do research. What are the most common likes? Just combine all the files using cat, and use a bash script to find the most common lines:
I plan to write more about this soon, and I'll probably put it on the Pingtype blog. But I might put it on Medium, because people seems to like that these days. Maybe both. There's also my personal website, but I'm worried that people might complain about privacy, so maybe I should distance myself from it. I'm not afraid to write the comment here because we're all hackers.