Recommended Posts

So I woke up late today, around 1300, to find that my Nextcloud instance was down.  I'm hosting it on Debian Bullseye via the regular old tarball manually set up with Apache, MariaDB/MySQL, PHP, etc.  It's been running great for literally years across multiple in-place upgrades to both Nextcloud and Debian.

 

After doing some tinkering it came to my attention that I MySQL was complaining it couldn't connect to the database.  Easy enough I figured, I'll just log into MySQL and see what's wrong.  Upon trying to launch the MySQL shell though it would ask for the password and then error out saying it couldn't connect to the server.

"ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (111)"

So I thought maybe the .sock file got messed with during an update or something and wasn't being removed properly, so I verified the location of the correct file by looking at the configs, all of which pointed to the same file, and I then deleted that mysqld.sock file and tried restarting MySQL, but still no dice.  I tried rebooting the whole server just for kicks, no luck.

 

I tried reinstalling MariaDB/MySQL but that apparently doesn't get rid of the existing configuration files, so what I ended up doing was apt purge --autoremove on mariadb-server, deleting /var/run/mysql, then reinstalling it and re-importing my most recent database backup (yesterday).  It's just a personal instance with myself, my wife and kids on it, and I've got it scheduled to do daily backups of the database, so it wasn't a huge issue.  What I'm curious about is why it crapped out in the first place.

 

While poking around in syslog I found the following line:

mariadbd[1115]: 2022-01-01 11:55:02 0 [ERROR] [FATAL] InnoDB: You should dump + drop + reimport the table to fix the corruption.


That timestamp is hours after any kind of automatic update/reboot would have taken place.

 

So something crazy happened that corrupted the actual database, but why would that have broken my ability to log into the MySQL shell to try and correct it?  It's saying I should dump and reimport the database, but I couldn't do that without having access to the MySQL shell.

 

I've checked the logs for apt and I don't see any kind of updates that would have been applied by unattended-upgrades; my last automatic update was December 18th.

 

Did anybody else have anything happen today with their database?  I guess it's definitely possible that Nextcloud encountered some kind of bug and corrupted its database.  I've done short SMART tests on all the drives in the system and found no issues, and the server is running on an UPS so there shouldn't have been any kind of power fluctuation or outage to cause any issues.  My UPS is reporting no events since the 17th either.

 

I guess I'm posting all this just to try and fish for thoughts from any of you who may have encountered this kind of thing in the past, or who may have some idea as to what happened.  I've restored a backup and everything is fine, but if there's something I can do to prevent the issue in the future, I'd like to do so.

Link to comment
https://www.neowin.net/forum/topic/1414223-mariadbmysql-took-a-dump-last-night/
Share on other sites

Well, I have a bunch of MySQL / MariaDB 5.x and MariaDB 10.x instances which are all running without issue right now.

 

I've had things like that happen before though. One cause is if the filesystem temporarily ran out of diskspace which can cause a table to require fixing. I've a suspicion that MySQL doesn't behave well if the data filesystem is briefly marked as read-only but it's just a hunch.

 

Table corruption can stop MySQL from starting though. That's a thing unfortunately.

 

Personally I'd recommend enabling the binlog and adding "--master-data=2" to your mysqldump line so that you can recover the database right up to the point where corruption occured. If you backup both the database dump file and the associated binlog files then you're pretty well sorted in terms of data recovery I think.

I just checked my install of MySQL running on Raspbian and all is well. With having to do a complete wipe and restore, the last entries are more than likely gone to see what the last thing that was modified or added/removed. The last time I had any corruption on my setup was testing new additions and was completely my own doing. Have you checked any connection logging to see if any weird connections were seen around the after the last time you knew it was working?

 

@DonC has a great point as that missing data between the last backup could be vital to see what happened.

The log entry immediately prior to the error messages is Nextcloud invoking its cron.php script, so I'm guessing it has something to do with that.  I've made copies of syslog from that timeframe so I may dig into it some more later, but I'm tired of reading logs since everything is back up and working I'll save it for later.

Good luck and I am happy that at least everything is back up and going. Keep us posted if you do dig into this. I am interested to see what you find if you do

On 01/01/2022 at 21:05, Gerowen said:

 

Did anybody else have anything happen today with their database? 

 

chinese hackers

On 01/01/2022 at 22:44, Marujan said:

chinese hackers

I thought about hackers of some sort, but there was no indications of any files missing or modified, no suspicious entries in auth.log, nothing banned by Fail2Ban, etc.  On top of that, all the various services hosted by the server are all hosted by their own non-root user accounts/groups and SSH is not open to the world and enforces public key authentication.  I'm fairly certain it was just some weird-ness with the database during the execution of Nextcloud's cron script.

 

Besides, with only 4 users, outside of some script kiddie who happened across a public share link I've posted somewhere, there's not really any incentive to try and bother my personal server.

Edited by Gerowen

So here's a piece of syslog.  You can see that at 11:50 the cron.php script executes and there are no problems.  5 minutes later it runs again (this is scheduled/expected), and this is where the problems begin.  So in the block of time between 11:50 and 11:55, something screwy happened.  I was asleep at the time, so I personally wasn't doing anything on the server directly, but we've all got cell phones and PCs connected to it all the time, plus I've shared several public links for photo albums and such with family members over Facebook, so even if I wasn't logged in, Nextcloud is constantly doing "something" in the background.

image.thumb.png.b4d6e17c3252dbf55a75302c9e5a5541.png

 

Here's the contents of auth.log for that particular block of time.  Nothing suspicious, root running cron and www-data running Nextcloud's cron.php script.

image.thumb.png.f22bf5f0c8231a95da077253eca4d1af.png

 

The database and the Nextcloud server files are stored on the main system drive which is a Western Digital Blue 2.5" SSD.  The actual data directory (user files) is stored however on a separate, encrypted RAID 5 "storage" partition.  Both drives have plenty of free space available.

image.png.f3fed2caa212019ffd5d112db6aa65d9.png

 

image.png.da1083f5449bd822d948673846af89b9.png

 

I never thought to keep a backup copy of the corrupted database for further inspection but once I got the backup copy up and running I deleted it.  I even had a copy of /var/run/mysql as a backup in the event that purging/re-installing MariaDB didn't fix the issue, but once it was clear everything was working again I deleted it.  But as far as I can tell, everything looks fine.  All I can figure is that I encountered some kind of weird bug/edge case.  I am running an older system.  The "server" originally started out as an old HP Pavilion P6803W tower PC that I bought ages ago.  Since then it has received an upgrade to a 6 core AMD Phenom II processor, 16GB of RAM, a new power supply, new case, etc.  The only original part is the motherboard.  However, all of the hardware in it is old and used, and the RAM isn't ECC, so it's totally possible that there was some sort of a bit flip or some other hardware issue.  I haven't had any issues in the past, but that doesn't mean they can't start, especially since the system has been running basically 24/7/365 for going on a decade now.  The temps have always been in great shape because I put an over-sized 125 watt cooler on a 95 watt chip.

image.png.2f78d8657437b73c344eb786920159d6.png

 

There's no indications of this being any kind of an attack either.  No changes made to my firewall rules, no new packages installed or removed, no modifications to any of my systemd service files, no files apparently tampered with or bothered, nobody banned by Fail2Ban, no unexpected auth attempts or blocked traffic on my firewall, no weird entries in syslog/kern.log, (at least that I've noticed) etc.

 

On the hardware front all the drives check out after running some short SMART tests, but I will see about doing a memtest scan on it at some point just to verify whether there are any issues with the RAM.  I'm gonna hope that it was just a software bug and I don't encounter it again because even though I don't mind replacing the server, I'm kind of attached to the old girl, :p  I will also verify that I don't have any other services hogging up my RAM as well just to be safe.

Edited by Gerowen
added screenshot as evidence of free space

At least from the quick views, nothing looks out of place. Knowing the the hardware is as old as it is could be just a really unfortunately timed hiccup. If the drivs check out and no bad sectors found, my next check would be the ram.

 

Keep up on those backs to be safe and I hope it does not happen again.🤞

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now