Data

data sources

Datasets:

For now 3 datasets are in use. They were downloaded or scraped without using illegal techniques. They can be considered “found data” as no license was imposed, so free to use, even though they contain a lot of private information, the scale at which they are on the open internet makes it public data. I try to be conscious in the outputs by not blatantly enabling abuse or abusing the data myself. I just look for beauty

Facebook Profiles

In April 2021, 533 million Facebook profiles, complete with location, phone number, date of birth, relationship status, etc were left accessible by Meta. A file containing all the data was compiled by anonymous data collectors and made public. It was easy enough to obtain a copy when it was leaked.

Stable Diffusion Prompts

In August 2022, Stability AI released a beta version of the stable diffusion weights to a group of beta testers. These testers could use a discord server to publicly generate images. Over 20 million messages were scraped from the discord servers and put into a database that links these intricate messages and makes them searchable.

Recaptcha words

Words that aren’t words, because they cannot be in a dictionary so no computer knows about them. It’s rare to come across lists of these words- just because of this reason. These words are used to check if you are a robot or not. A turing test so to say, but also a battle that cannot be won. As soon as these words are in a list, they are worthless in this battle. So what use are these meaningless words then? An invention that was so short lived.

+------------------------------------------------+
|                               /->              |
|  +--+                        /    \\            |
|  |  |   I'm not a robot           | recaptcha  |
|  +--+                        A    v            |
|                               \\-               |
+------------------------------------------------+

facebook leak

What kind of company throws it's most intimate customer data at everyone who wants is... Meta. The facebook leak is a great source to start when you want to do some data diving

social