Topic: [Bug?] Database Export tool contains furry roleplay.

Posted under Site Bug Reports & Feature Requests

I was trying to create a script using the Database Export files as a reference to pull from, and my script failed far into the document at cell A428296 of the "posts-2025-01-13.csv.gz" file. The previous row ends with a description of "section=Story", followed by a multi-row break containing furry roleplay that ends with cell A428591. In all reality, it looks as if the corresponding post's description contains a roleplay-esque story that Excel failed to process, but either way, I figured it was worth a mention that the database's export seems to have a discontinuity in it since this seems easy to slip under the radar.

I cross-referenced this with both today's (1-13-2025) and yesterday's (1-12-2025), and they both seem to contain it. These are the only two I've checked.

Database export direct link: https://e621.net/db_export/
File(s) referenced is the post export .csv

A proper to-spec CSV parser is needed to properly parse the posts database. What you're looking at is a post's description, which sometimes contain stories. Plenty of these will break naive CSV parsers because of line breaks.

kora_viridian said:

wat8548 rolled their own parser in Python, complete with its own state machine, to handle all the strange cases

I might be dumb, but isn't there some pumping lemma (and something about finite state machines) that indicates such a parser would fail at some input eventually

kora_viridian said:

In a theoretical sense, sure, given a program and a set of inputs, can you tell if the program will eventually stop? If you could somehow prove this theorem with the e621 posts export and Python, it'd be pretty hilarious if the ACM paper that eventually wins the Nobel had to include a footnote explaining exactly what e621 is. :D

Well that's what happened with superpermutations and 4chan right? And i'm pretty sure the result i'm saying is old hat

Alright, no. The pumping lemma is a lemma (also known as a theorem) that states a particular property of regular and context-free languages (different but related), most often used to prove that something cannot be in that class.
A regular language can be parsed with a state machine and no other memory (very loosely using the word parsing, validating would be a better term), while a context free language can be parsed (again very loose) by a machine with just a state machine and a stack.

CSV, including all the edge cases used here, is fully regular. Validating a CSV file without worrying about the number of columns per row is as simple as making sure there's an even number of quotation marks. Not very helpful, but it does show the pumping lemma has nothing to do with it.
Even parsing can be done with just a state machine though: you take one input character at a time, with the state just keeping track of what column you're in and whether you're currently inside a string. When inside a string, the only way to escape the string is an unescaped quotation mark. There's no concept of nesting strings, so no stack is needed, and identifying whether a quotation mark is escaped or not just requires looking at the next character. Two quotation marks is equivalent to having seen none, while adding a single quotation mark to the output.

Telling whether a program will halt with a set of inputs is (for the general case) proven impossible by the halting problem - for Turing Machines, which as it happens pretty much every normal programming language can replicate and as such suffers from the same problem. However, a simple state machine will always halt (for finite input), and while a pushdown automata can get into an infinite loop it's possible to write a program that will predict whether or not that will happen.

scth said:
Alright, no. The pumping lemma is a lemma (also known as a theorem) that states a particular property of regular and context-free languages (different but related), most often used to prove that something cannot be in that class.
A regular language can be parsed with a state machine and no other memory (very loosely using the word parsing, validating would be a better term), while a context free language can be parsed (again very loose) by a machine with just a state machine and a stack.

CSV, including all the edge cases used here, is fully regular. Validating a CSV file without worrying about the number of columns per row is as simple as making sure there's an even number of quotation marks. Not very helpful, but it does show the pumping lemma has nothing to do with it.
Even parsing can be done with just a state machine though: you take one input character at a time, with the state just keeping track of what column you're in and whether you're currently inside a string. When inside a string, the only way to escape the string is an unescaped quotation mark. There's no concept of nesting strings, so no stack is needed, and identifying whether a quotation mark is escaped or not just requires looking at the next character. Two quotation marks is equivalent to having seen none, while adding a single quotation mark to the output.

Telling whether a program will halt with a set of inputs is (for the general case) proven impossible by the halting problem - for Turing Machines, which as it happens pretty much every normal programming language can replicate and as such suffers from the same problem. However, a simple state machine will always halt (for finite input), and while a pushdown automata can get into an infinite loop it's possible to write a program that will predict whether or not that will happen.

Thanks for the info!
So that means csv can be parsed by regex

snpthecat said:
Thanks for the info!
So that means csv can be parsed by regex

It can be validated by regex - you can write regex that determines whether something is a valid CSV file pretty easily, as long as you don't care that each row has the same number of columns. Actual parsing is trickier, though with repeated application (find first match, grab the capture, repeat) doable.