Topic: Best way to bulk download

Posted under General

whowillevenreadthis

Member

about 1 month ago

What is the best way to mass download from this site? Since HB2112 got passed I would like to download every post on this site, essentially making a local copy since this bill will apparently affect all users and not just the ones in arizona. I currently use gallery-dl to download and I use the --write-metadata flag to get the post's json file as well.

The way I would like to download is downloading by date (not by upload date, the date tagged). For example downloading everything made before the year 2000 in one go, so all posts with the tagged date of 1999, 1998, 1997 ... all the way back to the earliest tagged post (which if I remember correctly is a greek pot from approx 6th century BC). Is it possible to order posts this way?

Is there also a way to download the notes.json for each post as well? I asked in a different forum post and I was told I could use something like this for a post https://e621.net/notes.json?search[post_id]=5244295 but this seems like it would only work for one post at at time.

While I'm downloading posts in order of date, I would not download pools the same way. I am going to do those separately so the pools stay in order. Is there a way to show pools in order of id? If not, I'll just make a text file to do that.

dba afish

Member

about 1 month ago

re621, see:topic #25872

MkLXIV

Member

about 1 month ago

Re621 is apparently pretty popular. it's a browser extension. https://re621.app/

Edit: ninja'd. But consider this a vouch.

hiddenbird

Member

about 1 month ago

https://github.com/McSib/e621_downloader

this is also a good one

SCTH

Member

about 1 month ago

RE621 is a decent option for small jobs, not for bulk downloading. This is something you'd likely want a custom tool for, making it easy to get precisely what you want.

whowillevenreadthis

Member

about 1 month ago

scth said:
RE621 is a decent option for small jobs, not for bulk downloading. This is something you'd likely want a custom tool for, making it easy to get precisely what you want.

I agree, re621 looks like it's for small jobs.

hiddenbird said:
https://github.com/McSib/e621_downloader
this is also a good one

Maybe, but I already use gallery-dl, and e621_downloader hasn't been updated in 2 years.

I was mainly asking if there was something I could put in the search bar (or something similar) to make it show the posts in tagged date order.

SCTH

Member

about 1 month ago

whowillevenreadthis said:
I was mainly asking if there was something I could put in the search bar (or something similar) to make it show the posts in tagged date order.

There is not. Best you could do is search one year at a time. Also, most posts don't have a tagged date.

whowillevenreadthis

Member

about 1 month ago

scth said:
There is not. Best you could do is search one year at a time. Also, most posts don't have a tagged date.

Damn. OK, I guess I'll be doing that then.

Any idea how I would get the notes.json to download with the posts (if it exists)?

Donovan DMC

Former Staff

about 1 month ago

dba_afish said:
re621, see:topic #25872

Re621 is fine for a few pages of posts, but for millions it is absolutely not enough

You should be using the posts export

These contain all the data you need to download the image, you can reconstruct the image url like so:
https://static1.e621.net/${md5.slice(0, 2)}/${md5.slice(2, 4)}/${md5}.${file_ext}

With this you need zero requests to the api for posts, and you can download from the static server about as fast as your connection can handle

There's also a db export for the pools, so no downloading of pool json needed

For notes you should forgo downloading per post, just request /notes.json with the page query parameter set to a followed by the highest id on the previous page, and the limit set to 320, ex /notes.json?page=a0&limit=320, this will ensure you do as few requests as possible

These same page & limit parameters work for nearly every search the site has

Note that db exports are only made once per day, so they will be slightly out of date at all times
For the remaining ~2000 posts (depending on the time) you can just use the api

Updated about 1 month ago

user 2179434

Member

about 1 month ago

Is there a downloader that downloads posts by tags with blacklist support? RE621 is only for page-by-page downloads, it is not suitable for bulk downloads.

ConnoisseurOfYiff

Member

about 1 month ago

redliner said:
Is there a downloader that downloads posts by tags with blacklist support? RE621 is only for page-by-page downloads, it is not suitable for bulk downloads.

Grabber is good for that.

JustKhajiit

Member

about 1 month ago

whowillevenreadthis said:
What is the best way to mass download from this site?

Gallery-dl is good for individual pics, authors, your favorites or any other search result. Otherwise use api or recreate direct links to static.e621 servers and get everything from there. Suspect that gallery-dl does the same but at least can work with human-readable links too.
Edit: let gallery-dl dump links to already downloaded media into database (sqlite3 or something). Skips downloading already obtained stuff before even attempting download.

whowillevenreadthis said:
I would like to download every post on this site, essentially making a local copy

You'd need to run downloads literally 24/7 from now and as long as possible. Site as of now has 9.3TB of information, with images accounting for at least half of that (gif and video are annoyingly large). Downloading that is rather problematic unless you're a data hoarder. Doing it quickly is near impossible unless, again, you're a "copied whole website in a day" level data hoarder.

whowillevenreadthis said:
downloading by date (not by upload date, the date tagged). For example downloading everything made before the year 2000 in one go, so all posts with the tagged date of 1999, 1998, 1997 ...

As was said already, you can't. R34 has year tags but they're applied rather inconsistently, E621 doesn't have them at all. Best bet is to get everything and then sort files using custom algorithm checking actual file metadata. If date fields were not edited before with something like exiftool for example, then it'll work. If they were edited, well... good luck then.

Updated about 1 month ago

Donovan DMC

Former Staff

about 1 month ago

justkhajiit said:
You'd need to run downloads literally 24/7 from now and as long as possible. Site as of now has 9.3TB of information, with images accounting for at least half of that (gif and video are annoyingly large). Downloading that is rather problematic unless you're a data hoarder. Doing it quickly is near impossible unless, again, you're a "copied whole website in a day" level data hoarder.

I've managed some 800,000 posts in 24 hours, it isn't impossible
It depends on your internet speed and consistency

justkhajiit said:
As was said already, you can't. R34 has year tags but they're applied rather inconsistently, E621 doesn't have them at all.

We- do? 1899, 1987, 2020, etc

JustKhajiit

Member

about 1 month ago

donovan_dmc said:
I've managed some 800,000 posts in 24 hours, it isn't impossible
It depends on your internet speed and consistency
We- do? 1899, 1987, 2020, etc

Not impossible, but sheer size is rather terrifying. Wouldn't be unexpected if ISP or e621 itself was throttling connection after some point.

Year tags just don't appear on picture pages, but good that they exist. Makes downloads easier for OP: give gallery-dl a search link and wait.

Donovan DMC

Former Staff

about 1 month ago

justkhajiit said:
Year tags just don't appear on picture pages, but good that they exist. Makes downloads easier for OP: give gallery-dl a search link and wait.

They are on the post, my avatar has 2022 tagged on it and it shows up at the bottom in the meta category

JustKhajiit

Member

about 1 month ago

donovan_dmc said:
They are on the post, my avatar has 2022 tagged on it and it shows up at the bottom in the meta category

No tag here :D : https://e621.net/posts/4391797

Donovan DMC

Former Staff

about 1 month ago

justkhajiit said:
This is a better explanation (no tag :D ): https://e621.net/posts/4391797

Uhh yeah, people fail to tag things all the time, that's nothing new

Narrifox

Member

about 1 month ago

I wrote a tool earlier this year that grabs every post, tag, tag alias and wiki entry and maintains a local database. It will then attempt to download the data for every post. This runs daily and updates itself. I'm currently maintaining the entire archive of the website (minus user data). Be aware the total size is around 10TB. My goal was, if worst comes to worst, to create a bare bones site people can use to access the data. I have ideas that range from far too ambitious to just a simple backup.

If people are interested, I can clean up the tool and create an open source git project. Note that's its fairly rudimentary, I rushed to create it over a weekend.

rapefantasy

Member

about 1 month ago

Hydrus is immense at scraping literally everything off of Booru sites it can find. It supports E6 and numerous other sites, tag searching, blacklists, and so on. It usually grabs all the metadata from the post too - such as tags, description, source, so on - save for comments, up/downvotes and other user-specific data. It's rather complicated to set up, though, so I'd look into it first. Either way, I use it myself, and I can't recommend it enough.

Whiterun Gaurd

Member

about 1 month ago

narrifox said:
I wrote a tool earlier this year that grabs every post, tag, tag alias and wiki entry and maintains a local database. It will then attempt to download the data for every post. This runs daily and updates itself. I'm currently maintaining the entire archive of the website (minus user data). Be aware the total size is around 10TB. My goal was, if worst comes to worst, to create a bare bones site people can use to access the data. I have ideas that range from far too ambitious to just a simple backup.
If people are interested, I can clean up the tool and create an open source git project. Note that's its fairly rudimentary, I rushed to create it over a weekend.

This sounds like possibly the best way that we the community can preserve e6's repository.

I think what'd be most necessary in the meantime is being able to selectively download by universal parameters, like filesize and upload date. With that, it'd be much easier to coordinate file preservation.
If you think it's possible, an automated networking tool that displays which parts of e6 have been back up would be invaluable and could ensure total preservation of the site's posts.
Archivers like you give us hope for the future, thank you.

Zeadyaer

Member

about 1 month ago

whowillevenreadthis said:
What is the best way to mass download from this site? Since HB2112 got passed I would like to download every post on this site, essentially making a local copy since this bill will apparently affect all users and not just the ones in arizona. I currently use gallery-dl to download and I use the --write-metadata flag to get the post's json file as well.
The way I would like to download is downloading by date (not by upload date, the date tagged). For example downloading everything made before the year 2000 in one go, so all posts with the tagged date of 1999, 1998, 1997 ... all the way back to the earliest tagged post (which if I remember correctly is a greek pot from approx 6th century BC). Is it possible to order posts this way?
Is there also a way to download the notes.json for each post as well? I asked in a different forum post and I was told I could use something like this for a post https://e621.net/notes.json?search[post_id]=5244295 but this seems like it would only work for one post at at time.
While I'm downloading posts in order of date, I would not download pools the same way. I am going to do those separately so the pools stay in order. Is there a way to show pools in order of id? If not, I'll just make a text file to do that.

I think the site as a whole is around 10-15 Tb, however, even before you got close to a couple gigabytes I think the site would (or should) throttle the connection speed to provide optimal performance for the other users.

Donovan DMC

Former Staff

about 1 month ago

zeadyaer said:
I think the site as a whole is around 10-15 Tb, however, even before you got close to a couple gigabytes I think the site would (or should) throttle the connection speed to provide optimal performance for the other users.

The previous devs (Kira, Earlopain) have each said concurrent downloads are fine, in fact the amount they recommend had gone up from 3 when I asked to 5 when someone else asked

Cloudflare is bearing the brunt of traffic because of caching so it should not be that big of an issue

Zeadyaer

Member

about 1 month ago

donovan_dmc said:
The previous devs (Kira, Earlopain) have each said concurrent downloads are fine, in fact the amount they recommend had gone up from 3 when I asked to 5 when someone else asked
Cloudflare is bearing the brunt of traffic because of caching so it should not be that big of an issue

huh, how bout that.

Lonely Fox 89

Member

about 1 month ago

rapefantasy said:
Hydrus is immense at scraping literally everything off of Booru sites it can find. It supports E6 and numerous other sites, tag searching, blacklists, and so on. It usually grabs all the metadata from the post too - such as tags, description, source, so on - save for comments, up/downvotes and other user-specific data. It's rather complicated to set up, though, so I'd look into it first. Either way, I use it myself, and I can't recommend it enough.

Seconding Hydrus, being able to search your own personal gallery by tag is a huge boon. However, on a side note, for some reason Hydrus doesn't seem to be downloading species tags for me. Have you or anyone else here ran into that issue? (It's also not downloading copyright tags but I care less about those).

I've also used e621 downloader, which seems to work well enough, but doesn't grab tags or do any of the fancy stuff that Hydrus does.

Updated about 1 month ago

whowillevenreadthis

Member

about 1 month ago

What I settled on doing (at least for now, may change) is searching for posts by tagged year (ex: 2009) and using gallery-dl's -G option to print all urls (the static ones) and I write them to a text file. I think it took about a minute to write 10k urls. Then I just feed the txt file to aria2 with something like aria2c -c -Z -i urls. (pls tag posts by year if they don't have one)

I also figured out that gallery-dl can get the notes metadata as well (it just takes a bit longer), so I'm doing that.

long quotes from people talking about hydrus

I will look into this, as long as it can read from from json files it should work quite well.

rapefantasy

Member

about 1 month ago

whowillevenreadthis said:
long quotes from people talking about hydrus
I will look into this, as long as it can read from from json files it should work quite well.

It may/not support reading 'imports' like that, but I also haven't used it that way. Hydrus has a Discord, if you need to ask about some more. I'm mostly a pleb when it comes to it.
At any rate, what I usually do is search for a tag (eg. searching by year, as brought up before) and then letting it do its' thing. It grabs the URL's automatically, downloads the file, and moves on to the next link, stopping at the end of the tag. It also automatically skips over previously downloaded files, if the MD5 hash is the same. Otherwise, you can just use the in-built 'deduper' (or whatever it's officially called) to identify and manually remove dupes. Though, that semi-automatic deduping of videos will require an extra plugin and some setup.
All of that said, there have been some bumps in the road with Hydrus and e6 in the past, mostly having to do with having to log in through an extra plugin to see the 'good stuff'. May/not be fixed by now, but I'd err on the side of caution either way and just ask the aforementioned Discord if you've got any further questions.

Login to respond »