adrift on a cosmic ocean

Writings on various topics (mostly technical) from Oliver Hookins and Angela Collins. We have lived in Berlin since 2009, have two kids, and have far too little time to really justify having a blog.

MySQL replication and the idempotence myth

Posted by Oliver on the 11th of January, 2011 in category Tech
Tagged with: failmysqlreplication

I had a fairly frustrating time today trying to explain to someone that indeed the order of actions matters very much when trying to recover failed replication on a MySQL server. We had a good dump from the master taken with --single-transaction --master-data=2 (yes, all InnoDB tables here), and had imported it successfully. However, at the critical moment of reading the binary log name and position from the comments at the start of the dump file, the operator in charge had mistakenly entered an old CHANGE MASTER statement from the history and started the replication at an earlier position than it should have been.

The completely expected happened, and a short while later replication failed again on a duplicate insertion error. At this point there was some disagreement between myself and the operators nursing the system back to health. Their position was that we can simply start the replication from the correct point and let it continue safely from there. My position was that once you have restarted replication from an incorrect point in the binary logs, you can no longer be assured that your data is consistent with what is on the master.

Consider if you will a brief timeline of events:

Timeline 1                        Timeline 2

insert into foo values 1,2        delete from foo where bar > 2
insert into foo values 3,4        insert into foo values 1,2
delete from foo where bar > 2     insert into foo values 3,4

OK, so that's fairly nasty PseudoSQL but you get the idea. Regardless of which (contrived) timeline you judge to be the "correct" one, reordering them will lead to a different set of data at the end. In Timeline 1 you will have only values 1 and 2 in your table, whereas in Timeline 2 you will have 1 through to 4. Relying on the fact that you can skip past duplicate insertion errors will not save you from the inevitable updates or deletes that can also change the dataset in non-idempotent ways.

"But that's a completely contrived example" you might very well say. "It's unlikely something like that would happen on the real dataset" may be another phrase used. My answer to that is - as a sysadmin do you know 100% how the application works? Do you know every SQL query that will occur?

Consider Marty going back to 1955 in Back to the Future. The past is changed, and he has to spend a bunch of effort fixing things up so that he is eventually born, and still alive at least until 1985. He gets back to 1985 and a lot of things are not as he remembers. In order to ensure that it was just as he left it, he would have to ensure that every event from 1955 to 1985 happened just as it did in his own timeline. Similarly, to have any surety that your dataset will end up just "similar" to how it should be, you would have to trace through the entire binary log, fixing problems as you go (and even then the outcome would be doubtful).

There are far better uses of our time than making mistakes and pretending they don't exist.

© 2010-2018 Oliver Hookins and Angela Collins