Rails is a really cool framework to work with, but it is not fall proof, and it will not prevent you from doing stupid things, having that said, even with the best tools available, putting a new software in production, or doing a significant upgrade to a software that is already in production is always a high adrenaline operation.
I can bet you’ve already found one of these problems:
-
QA and Production have a different OS version and a software you have very well tested will not install in production
-
Production database has a lot more data than your test database, and that is causing performance problems
-
QA and Production, for financial or any other reason, use a different number of machines for different services
We’ll talk about each of these problems and about some ways of identifying the side effects, fixing them or adding a workaround for them.
QA and Production have a different OS version and a software you have very well tested will not install in production
Once upon a time, there was a system in QA, a major upgrade to a system that was already in production, as such, many libraries were upgraded, rails was upgrades from 4.x to 5.x, and many other upgrades were made. Everything was working fine, engineers tested the system, select users tested the system, the company CEO tested the system, there was no chance of having problems during the deploy to production.
Except that all the engineers forgot to check if the QA server was using the same Linux version as the production servers, this caused lots of different problems, starting by sidekiq not being able to use the redis version available in the Linux installed in the production server.
To prevent this problem, simply verify the version of the operational system in all environments, it is better to use the same version, at least in QA and production servers, the only exception to that rule is if you are planning to upgrade the version of the production server, in that case it is better to use the QA server to test the upgrade.
As a workaround, the incompatible software can be compiled from source, it is usually enough to install from source a compatible version. Never copy a binary version from one server to another, because that can have lots of unexpected problems due to library differences.
Production database has a lot more data than your test database, and that is causing performance problems
This problem is really hard to identify in QA, and happens usually in systems that have some kind of report interface or sometimes in the rendering of an edit interface.
I’ve seen this problem for example in a system user’s editor, in the user’s list screen, that had no server side pagination, and in a user profile editor.
The user list had problem because the QA had a really smaller number of users (around 100 users in QA and 60k users in production), this difference made the listing of users to freeze the screen, since no browser could handle the workload of adding 60k users to the DOM at the same time.
The user profile editor had a similar problem, because the properties being edited were added from the database, and some users in production had a significantly greater number of properties than the number tested in QA.
The only solution for this problem is to test with data as close to production as possible.
As a workaround, you’ll need to identify what is causing the slowness of the application, if it is screen rendering or database time.
For screen rendering, the easiest solution is to use screen pagination and similar techniques.
For database slowness, usually changing and optimizing queries is the only solution, for this problem, rails do a small help printing the query plan for slower queries, but it is even better to use a service similar to appoptics with an application plugin to help identify slower paths in the application code.
QA and Production, for financial or any other reason, use a different number of machines for different services
You’ll never need the same scalability in the test environment and in the production environment, but sometimes, at least it happened for me, in QA you have all services for the application in the same machine, and in production you have these services running on multiple machines for scalability and performance.
This can cause deploy problems when you add a feature and for some reason references one of these services as being in the same machine, the QA environment will not accuse any problem, everything will work as expected, but when you deploy your application to production, strange things can happen.
If you are very lucky, the problem will be simple and you’ll have an “Invalid URL”, “Connection Refused” or something like that.
If you are unlucky like me, you can just have one operation that usually takes less than a second, running in 5 minutes due to a routing problem caused by a request being made to an IPv6 address with no application listening on, and some ‘Execution Expired’ messages in the log file from a completely different service.
Of course this could have been prevented with good practices, using always host names and correct configuration in the respective environment file, but the ideal way to prevent this is is you’ll run this service split in multiple machines in production, try to use at least one machine per service in QA, if you’ll use 10 machines for the same service in production to scale it, it would probably not be economically viable to use the same number of machines in QA, but try to use at least one for service, for example, one for the web server, one for the WebSockets server, one for database, one for sidekiq queues, ans so on.