Topic: Sources cleanup: bulk question

Posted under Tag/Wiki Projects and Questions

Hey there,

I've made a script to find e621 posts which have incorrect/messy sources, (by various definitions), and it can fix some of them automatically. It's various simple things mostly, like old facdn links, sources used as comments, email addresses in source fields, and stuff. It's not doing perceptual image hashing yet

It's found about ~700k posts with source links that look broken in some way, and a total of ~840k errors. It thinks it can fix about ~690k of those errors automatically. (One post can have more than one error though, which would make the automatic fixing trickier)

That's a lot of changes, and hitting the API at the maximum recommended rate of 2 calls/sec would mean taking a few days to do all that, which doesn't seem like it would be recommended?
I'm not sure the best path to proceed here!

Here's the full report from it:

There are 3469758 posts in the dataset
700728 posts have sources matching at least one check
Total by check
- furaffinity.OldCDN: Total: 322010. Solvable: 322010 (100.00%)
- deviantart.OldFormatUserPage: Total: 172872. Solvable: 172872 (100.00%)
- furaffinity.UserLinkWithoutSubmission: Total: 98480. Solvable: 0 (0.00%)
- formatting.TitlecaseDomain: Total: 72681. Solvable: 72681 (100.00%)
- protocols.InsecureProtocol: Total: 65497. Solvable: 56919 (86.90%)
- twitter.TwitterTracking: Total: 62270. Solvable: 62270 (100.00%)
- furaffinity.DirectLinkWithoutSubmission: Total: 29263. Solvable: 0 (0.00%)
- misuse.EmailCheck: Total: 7655. Solvable: 0 (0.00%)
- protocols.MissingProtocol: Total: 3821. Solvable: 1886 (49.36%)
- furaffinity.BrokenCDN: Total: 2333. Solvable: 2333 (100.00%)
- furaffinity.CommentsLink: Total: 1882. Solvable: 1882 (100.00%)
- misuse.TextCheck: Total: 1563. Solvable: 0 (0.00%)
- formatting.SpacesInURL: Total: 955. Solvable: 955 (100.00%)
- protocols.UnknownProtocol: Total: 755. Solvable: 331 (43.84%)
- misuse.TagsCheck: Total: 478. Solvable: 0 (0.00%)
- twitter.TwitFixCheck: Total: 432. Solvable: 432 (100.00%)
- furaffinity.ThumbnailLink: Total: 375. Solvable: 0 (0.00%)
- misuse.LocalPath: Total: 284. Solvable: 0 (0.00%)
- misuse.CommaCheck: Total: 141. Solvable: 0 (0.00%)
- protocols.BrokenProtocols: Total: 13. Solvable: 13 (100.00%)
Total errors: 843760
Total solvable errors: 694584

So, the most common error by quite some margin isfuraffinity.OldCDN which actually does not necessarily need fixing. Links to d.facdn.net still work, and redirect to d.furaffinity.net (in contrast to furaffinity.BrokenCDN which is d2.facdn.net)

Description of each of the checks (in order of their ranking, above):

1) furaffinity.OldCDN: FA direct image links using https://d.facdn.net instead of the more up-to-date https://d.furaffinity.net
2) deviantart.OldFormatUserPage: Deviant links using subdomain based urls https://j-fujita.deviantart.com/ instead of the modern folder-based URLs https://deviantart.com/j-fujita/
3) furaffinity.UserLinkWithoutSubmission: Source lists which include an FA user page link, but not the link to the actual submission that matches. (Not automatically fixed atm)
4) formatting.TitlecaseDomain: Domains which are written in titlecase or sentence case, e.g. Twitter.com, or Furaffinity.net instead of twitter.com, furaffinity.net, etc. (Usually caused by mobile keyboards)
5) protocols.InsecureProtocol: Source links which use http:// instead of https://, automatically fixable on domains which we know support https
6) twitter.TwitterTracking: Twitter links which contain ?ref_src=twsrc^tfw, ?s=09, or ?lang=en tracking or extraneous information on the end
7) furaffinity.DirectLinkWithoutSubmission: Source lists which include an FA image direct link, but not the link to the submission page. (Not automatically fixed atm)
8) misuse.EmailCheck: Source lists which contain an email address? (But excluding ones where the email is the only source listed)
9) protocols.MissingProtocol: Source links which are missing their protocol, fixable if we know whether the domain uses https or http
10) furaffinity.BrokenCDN: FA direct image links using the https://d2.facdn.net CDN which was used for a while during migration to https://d.furaffinity.net, this CDN domain no longer works.
11) furaffinity.CommentsLink: FA submission links, which are directed to a specific comment with a #cid:12345 style addition to the URL
12) misuse.TextCheck: Source entries that seem to be messages? Sometimes fine, but sometimes pointless? (Would need manual checking)
13) formatting.SpacesInURL: URLs which contain spaces which are improperly escaped. Seems to be seen with a few VCL links
14) protocols.UnknownProtocol: URLs with an unknown protocol, often a typo. (Automatically fixed, if a common typo)
15) mususe.TagsCheck: Some old posts seem to throw their whole tags list in the sources? This check tries to find those
16) twitter.TwitFixCheck: Thanks to twitter's awful embedding in telegram and discord, lots of services like fxtwitter.com, vxtwitter.com, etc have popped up to offer a mirror with better embedding. But the source links can be changed back to twitter.com
17) furaffinity.ThumbnailLink: Direct links to furaffinity thumbnails (as opposed to direct links to images, or submissions)
18) misuse.LocalPath: These seem to be posts where the source is a path to a local file, like ./image.png, C:/Users/.. or something similarly useless.
19) misuse.CommaCheck: Some old posts seem to have sources added in a comma-separated way, rather than newline-separated.
20) protocols.BrokenProtocols: Various truncated protocols, where people have put ttps instead of https and stuff.

They seem broadly categorisable by automation level:
- Very simple (559699): furaffinity.OldCDN, deviantart.OldFormatUserPage, twitter.TwitterTracking, furaffinity.BrokenCDN, furaffinity.CommentsLink, twitter.TwitFixCheck
- Trickier/semi-automated (143722): formatting.TitlecaseDomain, protocols.InsecureProtocol, protocols.MissingProtocol, formatting.SpacesInURL, protocols.UnknownProtocol, protocols.BrokenProtocols
- Needs further automation (127743): furaffinity.UserLinkWithoutSubmission, furaffinity.DirectLinkWithoutSubmission
- Manual check (10496): misuse.EmailCheck, misuse.TextCheck, mususe.TagsCheck, furaffinity.ThumbnailLink, misuse.LocalPath, misuse.CommaCheck

Questions:
1) How do I proceed here? Should I try and work through it, or should I get an admin to fix them (or at least, the easy ones. Swapping d.facdn.netfor d.furaffinity.net for example)
2) Should I ignore the furaffinity.OldCDN ones? As those links still work, they're just not the correct domain anymore.

The script is on github over here: https://github.com/Deer-Spangle/e621_source_cleanup (but it doesn't have a readme or anything, I initially just threw it together on the train this morning)

So, how should I proceed from here? I'm happy to automate what I can, and leave it running over several days, but maybe it would be better for an admin to step in and do the very simple ones at a lower level?

Oh, whoops, I've just found these two feature request threads on the same topic, but with slightly different URL checks (some overlap)

This thread: https://e621.net/forum_topics/34368
Talks about furaffinity.OldCDN and furaffinity.BrokenCDN
and suggests a few more which I've added as: twitter.OldDirectURL, inkbunny.AnchorTag, furaffinity.UploadSuccessParam, and furaffinity.FullViewLink

And this other thread: https://e621.net/forum_topics/34356
Talks about what I've called twitter.TwitterTracking
And suggests an additional check I've added as twitter.MobileLink

I noticed while running with these new checks, that someone has attempted to fix the twitter direct image links in the past, and malformed some links. A couple examples:
- https://e621.net/post_versions?search%5Bpost_id%5D=1189701
- https://e621.net/post_versions?search%5Bpost_id%5D=1868356
So I added another check, twitter.MalformedDirectLinks

(Also found some sources with two links on the same line, so added misuse.TwoURLs for that)

Adding those checks, takes the totals up to this:

807364 posts have sources matching at least one check
Total by check
- furaffinity.OldCDN: Total: 321977. Solvable: 321977 (100.00%)
- deviantart.OldFormatUserPage: Total: 172842. Solvable: 172842 (100.00%)
- furaffinity.UserLinkWithoutSubmission: Total: 98480. Solvable: 0 (0.00%)
- formatting.TitlecaseDomain: Total: 72673. Solvable: 72673 (100.00%)
- protocols.InsecureProtocol: Total: 65462. Solvable: 56902 (86.92%)
- twitter.TwitterTracking: Total: 62233. Solvable: 62233 (100.00%)
- furaffinity.FullViewLink: Total: 54663. Solvable: 54663 (100.00%)
- furaffinity.DirectLinkWithoutSubmission: Total: 29263. Solvable: 0 (0.00%)
- inkbunny.AnchorTag: Total: 26887. Solvable: 26887 (100.00%)
- twitter.OldDirectURL: Total: 23296. Solvable: 23296 (100.00%)
- twitter.MobileLink: Total: 13425. Solvable: 13425 (100.00%)
- misuse.EmailCheck: Total: 7655. Solvable: 0 (0.00%)
- furaffinity.UploadSuccessParam: Total: 6786. Solvable: 6786 (100.00%)
- protocols.MissingProtocol: Total: 3821. Solvable: 1886 (49.36%)
- misuse.TwoURLs: Total: 3453. Solvable: 0 (0.00%)
- furaffinity.BrokenCDN: Total: 2333. Solvable: 2333 (100.00%)
- furaffinity.CommentsLink: Total: 1880. Solvable: 1880 (100.00%)
- misuse.TextCheck: Total: 1563. Solvable: 0 (0.00%)
- protocols.UnknownProtocol: Total: 727. Solvable: 327 (44.98%)
- misuse.TagsCheck: Total: 478. Solvable: 0 (0.00%)
- formatting.SpacesInURL: Total: 478. Solvable: 478 (100.00%)
- twitter.TwitFixCheck: Total: 432. Solvable: 432 (100.00%)
- furaffinity.ThumbnailLink: Total: 375. Solvable: 0 (0.00%)
- misuse.LocalPath: Total: 284. Solvable: 0 (0.00%)
- misuse.CommaCheck: Total: 141. Solvable: 0 (0.00%)
- protocols.BrokenProtocols: Total: 13. Solvable: 13 (100.00%)
- twitter.MalformedDirectLinks: Total: 5. Solvable: 4 (80.00%)
Total errors: 971625
Total solvable errors: 819037

The other threads suggest adding this as a feature in e621, to automatically format the links when a submission is edited. There's a suggestion that this is already done with pixiv stuff?
I've got a bit of Ruby experience, so maybe I could have a go filing a PR, but I'm not too sure whereabouts in the repository that kind of thing is defined.
Poking around, maybe it's in here: https://github.com/zwagoth/e621ng/tree/master/app/logical/sources
If you give some pointers, I can look at making a pull request to add some extra checks covering these?

I see there's even some fixes scripts in there to do it at the database level: https://github.com/zwagoth/e621ng/blob/master/script/fixes/011.rb I could file a PR to add some of those if you want?

(Obviously, only the "Very Simple" category from above can be added as checks, but it seems like a decent start)

The main processing function for sources is at https://github.com/zwagoth/e621ng/blob/master/app/models/post.rb#L342 where it runs each entry through the alternate source functions and processes them. So any fixups or modifications to urls should be done through the alternates system, as well as the remove duplicates functions that exist within it.

The "original_url" function for alternates allows modifications of the original url that are injected back into the source list in place of the url that was given. This should be sufficient to allow normalization of URLs for various sources. Making an alternate processor for individual sites would be the easiest way to accomplish this, as the system only takes the first processor that matches a domain list.

A fixup script for this should find relevant posts and then run them through the normal save process without generating a history item. Ensuring that code runs through the normal site functions is preferred to trying to patch database data directly.

If there is something that you need the alternates system to do but it currently doesn't support, please discuss it in an issue so implementation can be discussed.

Ah ha, that all makes sense! Awesome. Spent a while digging through that area and I think I understand it, so can work on adding some new source alternative classes!

I'm not sure I understand the fix scripts yet, need to do more reading on those. Is there significance to the numbering? I notice it hops from 53 to 100, should new fixes be added in the gap, or after the hundred-odd?

I hadn't realised there's a limit to ten source links too.. If linking a direct image link, submission link, and gallery link, that only really allows 3 sites. If someone uses twitter, FA, weasyl and sofurry, they've already used up their sources limit? I guess it will trim additional sources, then direct links, then galleries, etc, but it still seems a bit low maybe. Especially when characters upload to their own galleries too.
I can see there's 501 posts that hit that limit, but I haven't yet examined their use cases.
I get the need to have a limit, but ten does seem a bit low to me, when including gallery links and everything

For fix scripts, pick the next highest number. I skipped from 50s to 100s for the purpose of splitting up the e621 and danbooru specific fixes scripts. Most of them I didn't even check into the repo because they are irrelevant to processing things.

There is a limit on sources to keep things sane. Ideally the post was sourced from a single location and that's the one to keep, and then additional gallery links potentially fall off the bottom as needed to make room for more precise source locations.

Okay, I've started with the changes in the furaffinity source checking, mostly been poking around at how the Addressable::URI stuff works, because the docs are kind of dire.
That's in here, if you want to check I'm not making any big mistakes: https://github.com/Deer-Spangle/e621ng/blob/neaten_sources/app/logical/sources/alternates/furaffinity.rb
But I'm gonna write up the twitter stuff, and the fixes before I make a PR! Which might be towards the weekend.

I understand there being a limit on sources, but 10 just seems a rather low limit, when you push for 3 links for each actual source. This might be bikeshedding, but I almost feel like 30 would be a better choice, I cannot imagine a post needing 30 source links, but I can imagine it needing 10.

Oh, if there were things that might want fixing which are not restricted a given domain, like converting "ttps://" to "https://", how would that be implemented? In the base, or null alternative class? Or in the the posts.rb file?
Or is that the level of question which should be filed as a github issue?

EDIT: Added a deviantart handler too, because it seemed like an easy one, and a big one to fix

Updated

I would probably put the ttps thing in the base class. I appreciate you taking the time to do to this, and the code looks fine.

One thing though: since you are overriding original_url for the fa one, the base method which truncates the url doesn't get called anymore. It would probably best do move that truncation into another method which is called automatically somehow, so you don't have to remember to call super when implementing original_url yourself.

It's looking good but there's a few problems:

The DeviantArt parser currently errors because of an undefined constant from the class name using CamelCase while the filename is deviantart.rb. It's fine if you change DeviantArt to Deviantart here and here.

@parsed_url is never converted to string so it ends up only adding the path.

----

On the FurAffinity handler the comment anchor check raises an exception when a FurAffinity URL is supplied that doesn't contain a fragment. "undefined method `start_with?' for nil:NilClass"

If none of the if statements are met, @url never gets set and it results with FurAffinity sources just automatically getting deleted.

----

If you haven't installed the development environment I'd really recommend it.

Updated

On the topic of that, adding a few test cases to confirm that everything works as expected would be greatly appreciated. You can take inspiration from the one test file in test/unit/sources/. There once was a bit more there, but that got cleaned up.

Ruby has safe navigation,if @parsed_url.fragment&.start_with?("cid:") would get rid of that one error.

faucet said:
It's looking good but there's a few problems:

The DeviantArt parser currently errors because of an undefined constant from the class name using CamelCase while the filename is deviantart.rb. It's fine if you change DeviantArt to Deviantart here and here.

@parsed_url is never converted to string so it ends up only adding the path.

----

On the FurAffinity handler the comment anchor check raises an exception when a FurAffinity URL is supplied that doesn't contain a fragment. "undefined method `start_with?' for nil:NilClass"

If none of the if statements are met, @url never gets set and it results with FurAffinity sources just automatically getting deleted.

----

If you haven't installed the development environment I'd really recommend it.

Thanks for having a look! I have the dev env set up, but hadn't tried my changes out there yet, as I hadn't finished adding the sources I wanted yet. (I actually have found it very helpful, when experimenting, to exec into the e621 docker container and use the irb there, rather than set up a local one with the required gems!)

Hadn't realised Ruby was so restrictive on filenames and class names, huh! I'll fix that up.
Oooh, I realised the nil check error today actually, yeah! I fixed that up, but didn't realise ruby has safe navigation, that should help neaten things!

I've written up all the source handlers I had planned now (any other ideas for some?), and I wrote up a fix script (number 108), I'll add these changes, and some tests and then make a PR!
Thanks for this btw, handy pointers on Ruby things, and I have been eager to contribute to e621 for a while

Hadn't realised there are tests too, I'll very happily add to those!

Added a bunch of unit tests now, but can't seem to run them on my machine, and haven't tested on my home deployment.
It's gotten very late here though, so I'll test that out tomorrow and make a PR (I guess there's no CI set up to automatically run the tests on PR though, is there?)

dr-spangle said:
Added a bunch of unit tests now, but can't seem to run them on my machine, and haven't tested on my home deployment.
It's gotten very late here though, so I'll test that out tomorrow and make a PR (I guess there's no CI set up to automatically run the tests on PR though, is there?)

You should be able to get the tests working with docker-compose run -e RAILS_ENV=test e621 bin/rails db:setup then you can run individual tests with docker-compose run tests test/functional/file_name.rb:line_number.
(or docker-compose run tests to run every test... but it's currently just a spam of failing tests)

I only tried the DeviantArt and FurAffinity tests at the moment but both seem to fail

Updated

faucet said:
You should be able to get the tests working with docker-compose run -e RAILS_ENV=test e621 bin/rails db:setup then you can run individual tests with docker-compose run tests test/functional/file_name.rb:line_number.
(or docker-compose run tests to run every test... but it's currently just a spam of failing tests)

I only tried the DeviantArt and FurAffinity tests at the moment but both seem to fail

Awesome! Okay, I'll get to tweaking that! Looks like a few more nil checks needed, and some import issues (Ruby imports confuse me to be honest. I'm used to python imports, but Ruby ones don't seem scoped in the same way, and they dump everything inside them into the current namespace and stuff?)
But yeah, cool, I can work on getting those working, thank you!

EDIT: Oh, I see, I wrote a lot of these as Sources::Strategies but they're Sources::Alternates, my bad

Updated

There we go, automated and manual tests are all looking good, and I've filed the pull request here: https://github.com/zwagoth/e621ng/pull/426

Hopefully this helps ^_^ Thanks for all the pointers! As I said, I was excited to contribute ^_^

I'll have to look into fixing some of the other issues my script picked up via another script and the API. The furaffinity.UserLinkWithoutSubmission and furaffinity.DirectLinkWithoutSubmission errors are the ones that originally bugged me, but cannot be fixed without API calls and/or a database of perceptual image hashes on other websites, so I can look into whipping up a script that does that.
If it can do it at full speed, it should take less than a day to fix all those. I'll probably slow it down a bit more than that, but should be fine.

EDIT: I've filed an issue about the allowed number of source links also, to put the bikesheeding somewhere separate: https://github.com/zwagoth/e621ng/issues/427

Updated

Hey again!

I've been fixing the "furaffinity.UserLinkWithoutSubmission" errors with a script I threw in my check repo here: https://github.com/Deer-Spangle/e621_source_cleanup/tree/master/e621_gallery_finder
It went through finding things based on perceptual image hash, and then it presents them in a web UI, and I'm manually checking them each and approving them to update the sources.
However, I keep hitting the 70 edits per hour limit. The wiki says I can request to be upgraded to Privileged level, to remove that limit, if I've proven I'm being helpful.

May I request that? I've done a few thousand edits total, and a couple thousand of these source fixes and stuff, and eager to help out more!