a decade ago this month, i was sampling napster library content.
not what you are thinking. only the symbolic content.
this was a project for a research organization studying music trends among peer-to-peer file sharers. original napster had an interesting feature: it allowed users to view the entire library of someone who shared content. the rest is easy: i wrote a napster client to sample shared libraries. this client ramped into random sampling like this:
start with a set of seed search words
randomly pick from many responders [which changed rapidly]
get their entire library, save aside
use random picks from these libraries to seed a new search
I sampled napster for four days, at its peak, and obtained 2647 unique libraries. [i dropped a very small number of duplicates]
here is the entire library data, unmodified, except all usernames were blanked out. you are welcome to do whatever you want with it, so long as you let me know of your research and analysis results, and give due credit.
ps: if you like to get this data with revision control: hg clone https://bitbucket.org/plan9/napdata