Skip to content

Lessons in Reliability: Rishi Goomar’s Journey from Dev to SRE

Insights from a Staff SRE at Lattice on Building Resilient Systems

Rob Genova
September 11, 2024

 

Rishi Goomar is a Chicago-based Staff Site Reliability Engineer at Lattice and an advisor at QPoint. With a background in front-end and full-stack development, Rishi has evolved into a leader in continuous integration, site reliability, and cloud infrastructure. He is passionate about exploring new technologies and fostering growth in both technical skills and leadership.

Part of our Ops Heroes series, this interview dives into Rishi’s journey from a curious tinkerer to a seasoned SRE, exploring the insights and experiences that have shaped his career.

From Tinkering to Tech

 

"That's one of the hills that I'll die on. Jenkins is not good!"

 

Qpoint: Rishi, you’ve had a fascinating journey from front-end development to becoming a Staff SRE. What sparked your interest in the tech field, and how did that passion evolve into a focus on operations and SRE work?

Rishi Goomar: What sparked my interest initially was tinkering around in high school. I built my own computer. I dug into one of the first Raspberry Pis. I also got into trying to figure out ways to make money without leaving home which led me to learn some basic skills like HTML, CSS, JavaScript, and so on. This was probably 16 years ago!

My first jobs were front end focused with some backend, but I've always had an interest in Linux. I probably bricked two or three laptops, just trying to mess around with complicated distributions like Gentoo, where you have to compile everything and configure a bunch of flags.

So when it came time to deploy, some of that Linux tinkering came in handy. And I was like.. oh, I could write a bash script to do this! There were no complicated containers. It was basically  FTP this up and use a script to initialize it. There were definitely issues though. You get situations where you have to troubleshoot and rollbacks were not easy.

One of the issues that really got me into the DevOps/SRE side of things was that we had no automated testing at my first startup. And we were growing fast - there were well over a hundred engineers. Myself and another engineer started a team just to get automated testing in place and get a CI pipeline going. That's how I started my first foray into SRE and infrastructure automation.

Qpoint: Where was your tech hosted? 

Rishi: This was on AWS but we used a slew of tools like Chef and Jenkins which I had to manage. And, I'm gonna be honest, Jenkins is not the easiest thing to manage because it doesn't use a good stateful system. That's one of the hills that I'll die on. Jenkins is not good!

 

Fighting Fires and Learning on the Job

 

"We had so much load that it caused our database to essentially topple over."

 

Qpoint: Which startup was it that really got you into SRE?

Rishi: Uptake was where I got my start in the ops space. And then after being on that team for a while, I joined Rocket Travel and that's where I became an infrastructure SRE full time. That was where I got the most exposure because it was me and one other person. And there were many fires to put out. Like .. this was a startup that is growing quite a bit and recently got acquired, and has a path forward towards more positive EBITDA and things like that, so there's a lot being done.

Qpoint: What was the craziest outage you had to deal with?

Rishi: There was a crazy one at Lattice where we got a huge amount of traffic due to one of our larger customers initiating a massive review cycle at the time. This caused a lot of strain on our systems. Some of the queries were not optimized and the code paths were not there. We had so much load that it caused our database to essentially topple over. We put the site in maintenance mode, but when we brought the system back up we ran into a thundering herd problem that brought it right back down again. Our workaround was essentially to throw money at the problem and beef everything up temporarily while we figured out a fix to the root cause. Easy to do with cloud!

Qpoint: You are still at Lattice today ..

Rishi: Yeah, so Lattice is basically an all-in-one HR tool at this point. It started out with talent management like performance reviews, 1:1’s, HRIS, Payroll (soon) and goal setting. We are currently moving towards an all-in-one people platform. 

Qpoint: What is your most valuable operations tool?  

Rishi: One of my favorite tools has been DataDog. You can integrate everything, use database management to see the queries, and do stack traces. I have found it to be a very powerful tool. For most companies, I feel like you should focus on making sure the reliability is there. You need the observability.

Qpoint: Have you had to deal with any security breaches?  

Rishi: Not any particularly heinous ones. More like vulnerabilities that we discovered through logging or we just noticed something very strange. I’ve found that a lot of those security incidents tend to highlight principles that you can apply elsewhere though.

Qpoint: Which principles have you found to be particularly important?

Rishi: Definitely least privilege. And regular pen testing. Using a service like HackerOne on your site is incredibly helpful because not everyone knows how to test through those use cases.

Qpoint: Any other ops hills that you would die on other than Jenkins? 

Rishi: You don't need Kubernetes for everything. I’ve seen plenty of startups try to start off by building clusters and getting complicated with their systems. There is so much complexity with a Kubernetes cluster, especially as you scale up and need to manage multiple clusters. There are many open source alternatives like Dokku and Kamal or even paid services like Render or ECS  Fargate that abstract a lot of the complexity.

 

SRE Principles That Matter Most

 

"When you're at a company that's growing rapidly, just embrace the failure and learn from it."

 

Qpoint: When something goes wrong, how does that tend to play out on your team?

Rishi: The one thing I think we've done pretty well with is attribution of issues. We've figured out how to route the different types of errors to the correct teams or services. For the most part, the right people get looped into an incident. When things do go wrong, the important thing is to learn from it and make sure that we're having a conversation about what happened, what we can do better, and not necessarily point the finger. What I've seen happen a lot is that a person will take the accountability for it themselves and be like, “Oh, I'm sorry I caused that”.

And honestly, I see that as a rite of passage. When you're at a company that's growing rapidly, just embrace the failure and learn from it.

Qpoint: Any thoughts on the state of SRE today or how you think it should evolve?

Rishi: I'm seeing an increased focus on data stores, which has been really good to see. It makes sense with the AI boom that's happening right now. I've dug pretty deep into Postgres over the last few years. I think there's a lot of power in these database tools that is currently underutilized. We were able to get rid of two whole read replicas just by optimizing queries without sacrificing performance.

Qpoint: Will AI be a game changer for SRE? 

I think it will change things for the better, but we're not there yet. There are a lot of things that end up being manual work or toil on the operations side, and I think there's an opportunity to use AI to help reduce that. I think it's going to be a while before AI can recommend the right infrastructure or architecture based on the use cases and how to make it highly available and integrate with other systems. But I see it being very useful as an assistant. I leverage Copilot or Anthropic like a compass in a way. It writes pretty good Bash. For complex queries though, it can give suggestions but not necessarily the right answers. There is a lot more on the line (for ops) than just having it write an application for you. Since it is like the infrastructure upon which all of the applications for your company run - if something goes wrong it's going to impact the whole business.

Qpoint: Looks like you are living in Chicago. Did you grow up there?

Rishi: Yeah, I grew up in the suburbs here and have been living in the city for 13 years now.

Qpoint: I noticed on your website that you're a foodie. Do you watch The Bear?

Rishi: Yep. I’m a bit of a foodie. I think regardless of being a foodie, everyone in Chicago is into The Bear! Partly because the place on the show is a real restaurant, Mr. Beef.

Qpoint: Have you been there?

Rishi: Yeah, it's pretty good!  I'm more of a basic or a lazy cook at home, but when I go out I love to try whatever the restaurant specializes in - especially the weird stuff!

Qpoint: Thanks so much for your time today, Rishi.

Rishi: Thanks, Rob. Take care.

 

Connect with Rishi

Check out Rishi on LinkedIn and GitHub to dive deeper into his world of SRE and tech innovation.

Take Control of Your External Integrations with Qpoint

Are your highly connected applications and their external dependencies causing operational headaches? Schedule a demo with one of our solution engineers to discover how Qpoint can help make your modern applications more resilient, secure, and cost-effective.