Hacker Newsnew | past | comments | ask | show | jobs | submit | tfederman's commentslogin

It's not a big data set that lends itself primarily to analysis, it's more like content. For example, a list of all US Presidents with a lot of metadata or text content fields about them collected/combined from different sources, cleaned, corrected, annotated, etc. (Pretend Wikipedia has only a subset of these fields and considers broadening them out of scope.)

As for Github, the data would still be under "my" account and I'm thinking about more of a platform that doesn't depend on one person. Maybe I would manage day to day version control in Github but I'd want to promote occasional releases to be more official and not reliant on my account.


In the GPT-2 era I created CouldReads, a big data set of generated book titles/synopses trained on thousands of e-books. It was a fun project in the naivete of 2020 but it's less amusing now.


A while back I wrote up a way to turn the big Wikipedia XML dump into a database. Not a generic table with articles but thousands of tables, one for each article "type". I'm not sure if this is still the best way to go about it.

https://feder001.com/exploring-wikipedia-as-a-database-part-...


That looks like a crowdsourced project for turning arbitrary sites into RSS which is very cool, but I don't see a way to get a large RSS data set out of it. And with about 5000 sources (I think) it's not as large as what I was hoping for, but it could be a good complementary source.


RSS reader through Bluesky custom feeds: https://github.com/tfederman/stroma-news

Bluesky API library spun off from the other project: https://github.com/tfederman/pysky

Haven't really started it yet, but a master list of RSS feeds and the code I used to source them: https://github.com/tfederman/huge-rss-list

And also a new project to fetch all links seen in the Bluesky firehose and gather metadata to build a database of sites and pages at a more granular level than the domain. For example, is account X posting video links from one YT channel or many?


Just for fun I wanted to do a simple server-side version of this where the submissions would be truly hidden on my account, so it would take effect on mobile too. And avoid client side artifacts like messed up numbering.

https://github.com/tfederman/hacker-news-topic-hider


I wish I had found this before I made my own client-side version! https://news.ycombinator.com/item?id=35352160


If anyone's interested in an approach to processing the data set quickly, I got something working and wrote it up when I was curious about turning the content into structured data for database tables.

https://feder001.com/exploring-wikipedia-as-a-database-part-...

https://feder001.com/exploring-wikipedia-as-a-database-part-...


My company just doesn't have product or project managers and it's wonderful. Fewer meetings, no intermediaries, more agility. It could only work in practice, it could never work in theory.


As someone who is responsible for 10-15 concurrent IT projects across numerous teams, many of whom I don't manage, this sounds terrible. I think the "no PMO' approach may work for homogenous groups whose focus is one project, but for a large organization, effective project management is a must.


That's maybe true, and also a good reason I wouldn't work for a large company again.


I feel like I spend a troubling amount of time as a software developer dealing with and/or avoiding solutions that are much more complicated than the problems they're trying to solve.


tell me about it... I've been searching for a C++ library that will encode x86 instructions from syntax that looks moderately similar to assembler code. Not a disassembler, not a JIT toolkit, not a special cross-platform IR that gets compiled to real instructions. Just an instruction encoder with a moderately pretty syntax. No luck so far.


Have you seen xed [1]? It seems to fit the bill nicely from what I understand about your requirements.

[1] https://software.intel.com/sites/landingpage/pintool/docs/67...


also Xed does not seem to be open source. It ships with a bunch of headers that say "Intel Open Source License" at the top but there is a binary library instead of source files.


yes, I don't need its decoding capability and I don't think its syntax is pretty but if I ever decide to write my own library I'll probably wrap xed in some C++ sugar.


I'm not sure I understand your requirements but might DynASM be useful? It's one component of the JIT library behind LuaJIT but many people use it for run-time code generation completely outside a JIT setting.

http://luajit.org/dynasm.html


DynASM looks cool but its preprocessing step and fancy C integration definitely place it outside the description "Just an instruction encoder with a moderately pretty syntax."

I guess it's not fair to say that XED and DynASM are "much more complicated than the problems they're trying to solve." They are much more complicated than the problem I'm trying to solve. But I am surprised that there is no minimal X86 encoder with nice C++ syntax out there.


When I take notes because I need to retain something, the writing part is more valuable than reviewing it later. And writing on paper is important, typing isn't effective for that part.

When I take notes just to record details that I can look up later on demand it's faster to type and there's no downside. But that matches work more than it does school.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: