Hey there,
I've made a script to find e621 posts which have incorrect/messy sources, (by various definitions), and it can fix some of them automatically. It's various simple things mostly, like old facdn links, sources used as comments, email addresses in source fields, and stuff. It's not doing perceptual image hashing yet
It's found about ~700k posts with source links that look broken in some way, and a total of ~840k errors. It thinks it can fix about ~690k of those errors automatically. (One post can have more than one error though, which would make the automatic fixing trickier)
That's a lot of changes, and hitting the API at the maximum recommended rate of 2 calls/sec would mean taking a few days to do all that, which doesn't seem like it would be recommended?
I'm not sure the best path to proceed here!
Here's the full report from it:
There are 3469758 posts in the dataset
700728 posts have sources matching at least one check
Total by check
- furaffinity.OldCDN: Total: 322010. Solvable: 322010 (100.00%)
- deviantart.OldFormatUserPage: Total: 172872. Solvable: 172872 (100.00%)
- furaffinity.UserLinkWithoutSubmission: Total: 98480. Solvable: 0 (0.00%)
- formatting.TitlecaseDomain: Total: 72681. Solvable: 72681 (100.00%)
- protocols.InsecureProtocol: Total: 65497. Solvable: 56919 (86.90%)
- twitter.TwitterTracking: Total: 62270. Solvable: 62270 (100.00%)
- furaffinity.DirectLinkWithoutSubmission: Total: 29263. Solvable: 0 (0.00%)
- misuse.EmailCheck: Total: 7655. Solvable: 0 (0.00%)
- protocols.MissingProtocol: Total: 3821. Solvable: 1886 (49.36%)
- furaffinity.BrokenCDN: Total: 2333. Solvable: 2333 (100.00%)
- furaffinity.CommentsLink: Total: 1882. Solvable: 1882 (100.00%)
- misuse.TextCheck: Total: 1563. Solvable: 0 (0.00%)
- formatting.SpacesInURL: Total: 955. Solvable: 955 (100.00%)
- protocols.UnknownProtocol: Total: 755. Solvable: 331 (43.84%)
- misuse.TagsCheck: Total: 478. Solvable: 0 (0.00%)
- twitter.TwitFixCheck: Total: 432. Solvable: 432 (100.00%)
- furaffinity.ThumbnailLink: Total: 375. Solvable: 0 (0.00%)
- misuse.LocalPath: Total: 284. Solvable: 0 (0.00%)
- misuse.CommaCheck: Total: 141. Solvable: 0 (0.00%)
- protocols.BrokenProtocols: Total: 13. Solvable: 13 (100.00%)
Total errors: 843760
Total solvable errors: 694584
So, the most common error by quite some margin isfuraffinity.OldCDN which actually does not necessarily need fixing. Links to d.facdn.net still work, and redirect to d.furaffinity.net (in contrast to furaffinity.BrokenCDN which is d2.facdn.net)
Description of each of the checks (in order of their ranking, above):
1) furaffinity.OldCDN: FA direct image links using https://d.facdn.net instead of the more up-to-date https://d.furaffinity.net
2) deviantart.OldFormatUserPage: Deviant links using subdomain based urls https://j-fujita.deviantart.com/ instead of the modern folder-based URLs https://deviantart.com/j-fujita/
3) furaffinity.UserLinkWithoutSubmission: Source lists which include an FA user page link, but not the link to the actual submission that matches. (Not automatically fixed atm)
4) formatting.TitlecaseDomain: Domains which are written in titlecase or sentence case, e.g. Twitter.com, or Furaffinity.net instead of twitter.com, furaffinity.net, etc. (Usually caused by mobile keyboards)
5) protocols.InsecureProtocol: Source links which use http:// instead of https://, automatically fixable on domains which we know support https
6) twitter.TwitterTracking: Twitter links which contain ?ref_src=twsrc^tfw, ?s=09, or ?lang=en tracking or extraneous information on the end
7) furaffinity.DirectLinkWithoutSubmission: Source lists which include an FA image direct link, but not the link to the submission page. (Not automatically fixed atm)
8) misuse.EmailCheck: Source lists which contain an email address? (But excluding ones where the email is the only source listed)
9) protocols.MissingProtocol: Source links which are missing their protocol, fixable if we know whether the domain uses https or http
10) furaffinity.BrokenCDN: FA direct image links using the https://d2.facdn.net CDN which was used for a while during migration to https://d.furaffinity.net, this CDN domain no longer works.
11) furaffinity.CommentsLink: FA submission links, which are directed to a specific comment with a #cid:12345 style addition to the URL
12) misuse.TextCheck: Source entries that seem to be messages? Sometimes fine, but sometimes pointless? (Would need manual checking)
13) formatting.SpacesInURL: URLs which contain spaces which are improperly escaped. Seems to be seen with a few VCL links
14) protocols.UnknownProtocol: URLs with an unknown protocol, often a typo. (Automatically fixed, if a common typo)
15) mususe.TagsCheck: Some old posts seem to throw their whole tags list in the sources? This check tries to find those
16) twitter.TwitFixCheck: Thanks to twitter's awful embedding in telegram and discord, lots of services like fxtwitter.com, vxtwitter.com, etc have popped up to offer a mirror with better embedding. But the source links can be changed back to twitter.com
17) furaffinity.ThumbnailLink: Direct links to furaffinity thumbnails (as opposed to direct links to images, or submissions)
18) misuse.LocalPath: These seem to be posts where the source is a path to a local file, like ./image.png, C:/Users/.. or something similarly useless.
19) misuse.CommaCheck: Some old posts seem to have sources added in a comma-separated way, rather than newline-separated.
20) protocols.BrokenProtocols: Various truncated protocols, where people have put ttps instead of https and stuff.
They seem broadly categorisable by automation level:
- Very simple (559699): furaffinity.OldCDN, deviantart.OldFormatUserPage, twitter.TwitterTracking, furaffinity.BrokenCDN, furaffinity.CommentsLink, twitter.TwitFixCheck
- Trickier/semi-automated (143722): formatting.TitlecaseDomain, protocols.InsecureProtocol, protocols.MissingProtocol, formatting.SpacesInURL, protocols.UnknownProtocol, protocols.BrokenProtocols
- Needs further automation (127743): furaffinity.UserLinkWithoutSubmission, furaffinity.DirectLinkWithoutSubmission
- Manual check (10496): misuse.EmailCheck, misuse.TextCheck, mususe.TagsCheck, furaffinity.ThumbnailLink, misuse.LocalPath, misuse.CommaCheck
Questions:
1) How do I proceed here? Should I try and work through it, or should I get an admin to fix them (or at least, the easy ones. Swapping d.facdn.netfor d.furaffinity.net for example)
2) Should I ignore the furaffinity.OldCDN ones? As those links still work, they're just not the correct domain anymore.
The script is on github over here: https://github.com/Deer-Spangle/e621_source_cleanup (but it doesn't have a readme or anything, I initially just threw it together on the train this morning)
So, how should I proceed from here? I'm happy to automate what I can, and leave it running over several days, but maybe it would be better for an admin to step in and do the very simple ones at a lower level?