0

How often should we restart microservices in Production as part of maintenance?

Profile picture
Vice President (SDE III) at Goldman Sachsa year ago

I was earlier part of another team, where all the monolithic apps and microservices are restarted in Production environment every weekend as part of scheduled maintenance.

In my current team, there's no automatic restarts. There are some microservices that haven't been restarted since 2+ years. Isn't this a potential problem? Won't "not restarting services" lead to increased memory consumption at some stage? Don't microservices need frequent restarts as part of maintenance?

On asking the TL, they mentioned that the microservice shouldn't be written in a way that it causes increased memory consumption.

But that's not what we can always control right? Hence we have maintenance windows.

266
5

Discussion

(5 comments)
  • 4
    Profile picture
    Coding Challenge Writer @ CodingChallenges.fyi
    a year ago

    If the service hasn't been restarted in 2+ years it is clearly not a problem.

    I'd agree with your TL, ideally you write reliable code that does not leak memory.

    Why do you think you can't control memory leaks?

  • 3
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    a year ago

    I was earlier part of another team, where all the monolithic apps and microservices are restarted in Production environment every weekend as part of scheduled maintenance.

    I'm going to be honest - I think that team just had bad infrastructure and they were restarting everything as a hack to patch over the effects of sloppy code. This makes sense as if your code leaks a ton of memory, you are eventually going to run out of it, hence the wipe being necessary.

    On asking the TL, they mentioned that the microservice shouldn't be written in a way that it causes increased memory consumption.

    Yep, I think this is the answer.

    It's tricky to detect memory leaks, but it's definitely possible. If you really want to go deep on it, you could add analytics here and track the problem over time with hard numbers (this is definitely a senior -> staff project IMHO).

    I know that this is different from your use-case, but the famous library for detecting memory leaks in Android is this one (memory leaks are a problem everywhere, so I'm sure every stack has an equivalent library): https://github.com/square/leakcanary

  • 3
    Profile picture
    Senior DevOps Engineer
    a year ago

    Not restarting for over 2+ years is probably a problem, but not for memory leak issues. More realistically it's just going to be likely that those microservices will be difficult to start again if they ever fail - 2 years is a lot of lost institutional knowledge.

    But blindly restarting each week isn't the solution - like Alex says, it's far more likely to be a band-aid solution over a less than ideal codebase.

    In DevOps there's a saying that's relevant here - "Cattle not pets". Your microservices should be resilient such that you could kill any component and not experience any pain. This is particularly important in ecosystems like Kubernetes where pods can and will be rescheduled.

    Netflix's solution to this was to implement a "Chaos Monkey", a service that randomly restarted production servers during the day to ensure their systems were capable of handling that kind of event.

    I think a more holistic solution is to approach these microservices with the questions of why they're like this? Why does one require weekly restarts? Why has one continuously run for 2+ years without a single restart? Could this be improved or changed?

  • 2
    Profile picture
    Tech Lead/Manager at Meta, Pinterest, Kosei
    a year ago

    Won't "not restarting services" lead to increased memory consumption at some stage?

    Why do you think that? Don't invent problems: unless you have historical evidence of issues that have resulted from not restarting (either at GS or another company that is similar), I wouldn't worry too much.

  • 0
    Profile picture
    Senior Software Engineer [OP]
    Goldman Sachs
    a year ago

    Thanks all! That helps :)