Lessons Learned – Minecraft , Ansible, Crashes, Debugging

Quick Link – Github Project Repo

Introduction

Valhelsia Enhanced Vanilla maintenance update 1.1.2 has been released with a number of fixes and changes. We’ve been experiencing frequent crashes on the servers and stalled restarts, so some attention was needed to get things working properly again. The first thing I did was take snapshots of the Minecraft servers in VMware vCenter, so if anything went wrong, I could always revert to a stable state. I also made sure to have stable backups of the world files generated and stored on the remote file server, so I could recreate the server without any data loss.

Ansible Updates

The Ansible scripts to install/upgrade Minecraft (either Vanilla or Valhelsia) have been upgraded to back up and restore the whitelist, op, and other json files. The other changes are based on the lessons below, such as tweaking the backup script to back up more than the world folder, ensuring the Minecraft service auto-restarts, and that a crash won’t wait for user input to terminate. I also removed the Chunky Pregenerator step, as this won’t be desired on an upgrade, and can be initiated manually.

You can find the latest copy of the Ansible scripts on my Github page here.

Lesson – File Server – Point of Failure

One of the recent outages on the Minecraft servers was due to the virtual machines running out of space. Minecraft acts VERY weird when this happens. Users could log in, but the server would frequently crash, and inventory would get lost or duplicated, and progress would not be saved. On our Minecraft servers, an automated script backs up the Minecraft files (specifically the world files) every 4 hours to a compressed tar.gz file, then moves it to the backup folder, which is a mounted CIFS share running on TrueNAS SCALE. The file server stores about 2 weeks of rolling backups, about 1TB in size. The problem occurred because the network file share had disconnected, likely due to a software upgrade and restart of the file server. Because of this, the backups started piling up in on the Minecraft VM, eventually taking up 100% of the space. Once this happened, the server became unstable.

I was able to manually fix this and restore the last uncorrupted world backup, but it’s happened a few times, so further action is needed. I’d like to set up some kind of alerting to notify me if the file share mount disconnects, the VM drive space gets too low, or the server crashes too many times in an hour. This will definitely be a new project to set up something – suggestions are welcome!

Lesson – Minecraft Valhelsia – “Press Enter To Continue”

One of the issues with the Valhelsia Minecraft servers is frequent crashes. An occasional crash wouldn’t be a problem, but the modded servers did not seem to automatically recover. I would have to either manually restart the service, or log into the tmux session and press “Enter” to have the script terminate and restart:

After dealing with this for a while, where a crash could mean hours of downtime until I manually restarted things, I investigated the issue further. The first thing I did was ensure the service had the proper settings to restart on failure. In the minecraft.service file, I made sure this was added:

Restart=on-failure
RestartSec=5s

This would ensure that if the service terminated or crashed, it would automatically restart. The other problem was that whenever the service did crash, it would wait for user input to “Press enter to continue” before it would restart. Eventually I asked on the Valhelsia Discord and one of the admins mentioned there was logic in their startup script which my service was calling. I was able to find this line and remove it manually, and also automated removing the line via my Ansible scripts:

Lesson – Minecraft Valhelsia – Debugging Crashes

There are a lot of different mods in the Valhelsia Enhanced Vanilla Minecraft Pack. Thankfully the team that maintains the pack puts out frequent updates, updating the modpacks to newer versions, adjusting balance, and removing items, creatures, and other mechanics that cause problems. The changelog showing these adjustments are available here.

Some occasional crashes are expected due to all of the modpacks, but one particular issue that had to be addressed was when one of the minecraft server went into a looping crash. It would automatically start up, crash, terminate, then repeat. When Minecraft crashes, it saves a crash report file to the crash-reports folder for later analysis, and the logs are also available to try and see what happened. From looking at the stack trace, I was able to tell that the crash was occurring in AttributeFix, which is a Darkhax mod.

I tested by pulling out the mod (deleting the .jar from the mods folder), and restarting the server and it did not crash. I then tested downloading the latest version of the modpack, and inserting it, and that also worked. Upgrading from AttributeFix-Fabric-1.18.1-13.0.5.jar to AttributeFix-Forge-1.18.2-14.0.1.jar resolved the issue, as did upgrading the entire server to Valhelsia Enhanced Vanilla 1.1.2.

Lesson – Minecraft Valhelsia – Know what your app is doing

While coding the changes to the Ansible scripts, one of the weird issues I ran into was I was backing up all of the json files that contain the server whitelist, op user list, server.properties, etc, and then restoring them after the server update. However, when I checked the files after Ansible had completed running, they were missing all of the user data. Thankfully I had backups/snapshots so I was able to restore these quickly, but I was very puzzled why my pre-install and post-install tasks weren’t working as I expected.

First, I confirmed files were getting copied to the /tmp directory properly while Ansible ran. I was able to check this directly. So the problem had to be on the post-installation side in the postinstall.yml file.

To debug what was going on, I ran the Ansible Playbook with the -v option, to add more verbose information around the information reported back during execution. This didn’t give me what I needed, so I started adding pauses using the Ansible Pause Module so I could verify the file integrity and when it was changing.

Eventually I figured out the problem – I was replacing, for example, the whitelist.json file while the Minecraft service was running. This worked properly, but you need to reload Minecraft through the console to force it to re-read from the files. Otherwise, the next time you whitelist or op a user, it clobbered the file because it didn’t know about the changes. I was able to adjust the commands I was sending into the Minecraft console to reload (and re-read from the files) before making other changes, and this addressed the issue.

Definitely one cases where you need to know how your application behaves!

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *