These Years I’ve Learned: the technical bits, and more

microwavestine
14 min readMay 29, 2023
Mammoth Mountain, CA. Rolled down those hills instead of skiing; lost phone, contemplated life and stuff in this post. Photo from official Mammoth website.

TLDR;

  • Write code like it’s going to break
  • Deploy services like it’s going to go down
  • Write comments, and document like you will forget tomorrow (you eventually will) + log and monitor like a historian

Thank you to kristopher for mentoring, befriending me all these years & words to remember by like the TLDR above.

This is also a tribute to Sanghyun Gwak, who had worked on the South Korean EHR system; Sanghyun has passed away in 2021 from Coronavirus in Cambodia. Thank you for supporting my interest in all things technical, and reigniting my spirit in the emotionality of Andrei Tarkovsky’s films. May your soul & kernel rest in peace.

I have left the job I loved in April 2023.

Over the years I have learned that refining a tool or skill requires an adequate amount of oil and abrasive; otherwise, the axe or knife turns more brittle, rusty, and ultimately dangerous.

That oil and abrasive in the context of our professional lives, in my opinion, is leisure and emotional well-being. Without it, you could still have a knife. One could still work… but one would have to work twice as hard and maybe cut a finger or stab oneself doing it (sometimes literally). I have seen / experienced too many broken hearts and sociopaths created from relentless hustling.

I loved the work I am doing in itself — but the workplace was falling apart and (metaphorical) oil was running out; ironically to continue working, I had to leave work.

Postquam nave flumen transitt, navis relinquenda est in flumine.

When you cross the river, you have to leave the boat behind.

- Lectio Linguae Latine by Professor Dong-il Han

I am writing this post to reflect on the past few years building up this career and to hopefully transfer this knowledge onto my next journey or to pick up where I left off if I come back. And for people starting off on their careers in SaaS (backend), it may give some ideas on how to survive through the first few years.

TDD

Regardless of the test methodologies (real data vs mock data), the core philosophy behind TDD is that the code will break, and testing provides a minimal guard against possible failures we can imagine. In reality, that guard or global filter you thought catches all exceptions, may not handle it properly when you’re dealing with the intricacies of GraphQL and third-party APIs. Production is hard. The data source that can’t possibly return null, such as primary keys from databases, might return null for unknown reasons. The customer might ask where the bathroom is.

There are two ways to approach the fact that a code is going to break. One is prevention which is what TDD and test files are about. Another is recovery; I think that behind the never-ending discussions around clean code principles is not about code performance but a discussion for what is the best way to organize code in such a way that we can detect bugs easily and modify code as needed.

To put it in another way, prevention is “I am going to make this code bullet-proof ” and recovery is “…but in case there’s a stronger bullet, I will make sure there’s a way to fix it quickly”. I theorize that it takes both approaches to make a piece of code more reliable and predictable — pulling together thoroughness and humility.

Some of the most common mistakes/test fails I made and saw in each approach are…

Prevention

  • Destructuring without default options. Assuming the value is going to be there.
  • Forgetting null coalescence / dealing with null values
  • Assuming nullcoalescence will prevent printing “undefined” (if the data source sends it in string format, you need to explicitly check for the string value of “undefined” and other such values)
  • Throwing errors with JSON.stringify() for better logging
  • “All unit / e2e tests passed so it’s okay”
  • “CI/CD passed so the server/pod must be running” (hello, we don’t have a server health check alert in place yet?)
  • “I have deployed the right image” (did you really?)

Recovery

Where did I get this idea?

I worked through “Railway Engineering: An Integral Approach”, an online course by DelftX on edX. Why Railway Engineering? Because I love trains. I’ve deleted Facebook account, but loved the Numtot community there & shared about South Korean train events and tours.

I learned of preventive and reactive maintenance measures in railway systems which I connected to the role of TDD and code documentations.

Learn Go with Tests provides solid groundwork for TDD, which I wish I had learned earlier.

MSA Project: Connection Reset, Connection Refused

Monday

You’ve been tasked with working with a particular service A in a semi-MSA architecture (and there are plans to further convert services into microservices). Your service calls another service B that provides much of the data for your service.

Tuesday

Hotfix ticket arrives, and it seems simple enough so you let the assigner know that it can be fixed in an hour*.

*Don’t do this.

Then you proudly say to the assigner, “it’s been deployed, you can check it in production.”

The ticket assigner thanks you for your swift problem-solving skills and insinuates that previous engineers were “too slow to respond or to communicate”. You feel like you’re built for this job.

Then about an hour later, you get a DM from them: “The page of the hot-fix feature is blank, could you take a look?” You go on the page and indeed, the page does not load. You open the network tab. 503 error calling service B. Heck this could be anything. One-hour fix is going to be unknown hours.

Wednesday

You look at the logs for your service and notice an absurd amount of connection reset or connection refused errors. You DM the person in charge of service B, “hey, there’s this issue calling APIs for service B. Seems like service A keeps timing out. Can you check service B’s logs to see if requests from service A actually arrived?”

They respond, “nope, there’s no records of service A calling service B”.

Huh? So service A never actually retried…?

So you put in retry options for http services.

Thursday

They ring you up again. “It’s still the same. I even checked timeout for service B pod spec, it’s 60 seconds. There’s no way our data takes 60 seconds to load.”

Connection Resets while calling internal services

The main cause for the elusive operational issue above turned out to be a fairly well-known Kubernetes network bug, but it can be something easy to miss when the service architecture was not originally built to be MSA.

First possible cause: Kubernetes

This blog explains that due to Linux kernel’s race condition bug, there’s a bug on SNAT, DNAT iptables and consequently the bug affects Kubernetes pods to randomly drop packets between each other. Cases such as

a) service A sending request to service B but there’s no record of such request on B’s log

b) there is log on service B indicating it has received request from service A, but service A has no records of receiving the response from service B (and has timeout instead)

may explain the cause of the connection reset errors from timeouts. One thing to note from the blog is that if one sends request from pod to ClusterIP, the ClusterIP gets translated to PodIP of the service being requested using kube-proxy by default (and since it uses iptable, the bug lies here). One of the most commonly used cluster service is DNS, and this bug can also happen during name resolution.

Some more information regarding iptable bug:

When a packet for such pod-to-external traffic is detected as INVALID (for whatever reason), it’s wrongly delivered and causes the connection drop. The solution is to add one iptables rule to drop such INVALID packets .

Second possible cause: http & https settings

Service A did not have keepAlive set to true and it may be worth testing the effects of turning this setting on. However in NestJS 6 there is an issue of socket hangup; the problem appears to have resolved after NestJS 7.

In the end the temporary solution was to apply extensive caching on service A to reduce amount of traffic to service B. It made me wonder if there’s a better way to solve this issue for MSA, and after fumbling around discussions for micro + macro architecture, I bumped into the world of Go/gRPC, Elixir & they are my current obsessions along with trains.

Loading JSON from Vault / Kubernetes Secrets

I’ve learned the hard way that for most cloud based services like Vault and GKE, there is an unspoken rule for loading a JSON. For example for private keys, in Vault you have to load JSON key wrapped in an object with a key of private_key or privateKey, like this:

{ "private_key": "-----BEGIN PRIVATE KEY----- ... ------END PRIVATE KEY-----\n" }

If you’re looking to load Firebase private key from Kubernetes Secret or Vault to initialize Firebase Admin SDK … there’s a better way now!

Workload Identity Federation

Issues about service account key brings us to Workload Identity Federation. Still passing around private key files with coworkers? Contact devops immediately to setup WIF — or set one up yourself.

Health check on server boot

There’s no better tutorial example for Hello World server than implementing health check because it is as essential as naming convention; it is a convention to do health check on bootstrap for good reasons. If your service relies on third-party API end points or multiple databases and your server can’t ping them on boot, you can notice a problem early on before having to test every single one of them.

This will save you hours of unnecessary operational debugging work.

Learn to use Kubernetes IDE & study CKA

A lot of hours after deployment is spent on monitoring and debugging live development or production servers. Get comfortable with at least one Kubernetes IDE to switch different contexts, to restart pods and to access containers’ bash.

Even if you are doing backend, you will be working closely with DevOps in secrets and environment variable settings, sometimes even pod specs. I’d recommend studying for CKA; CKAD and CKS if you are aiming to become DevOps.

Learn to use log filter (GCP specific)

Google Logging provides lot of ways to filter logs. In the header of the “summary” column of logs, there is a small edit button to filter out the logs. I would usually add jsonPayload.context and jsonPayload.function to filter out unnecessary logs. Then further narrow down search results with resource.labels.namespace_name and other options. You will most likely be using Logging when you are looking for particular error message, as the UI is not quite suitable for real time monitoring.

Another Google feature to look out for is Error Reporting — you can keep an eye on here for any peculiar spikes in errors and resolving any error reports.

UX Performance Tuning > Code Performance

One of the services I was responsible for had extreme performance issues that were unacceptable for modern day users. Yet, there were plans to launch an ambitious new service (within a service, my god) that aimed to attract new users. I suggested that there had to be a focus session on improving UX, particularly in terms of performance, before new feature launch otherwise all the plans and effort of the team won’t shine as expected.

Three months before new feature launch, four front and backend engineers huddled up in the lounge for a short productive meeting on what can be leveraged for performance while running short on time. We opened up network tool on Chrome, took a look at it together and the result of the brainstorm was to…

  • Rearrange UI in a way that does not call query or time expensive APIs
  • Apply pagination
  • Investigate mysterious calls to third party websites, eliminate unnecessary calls

They were simple enough to be worked and tested within a week; and definitely a progress by the end of the sprint. DevOps and infrastructure team also helped out greatly in reducing network load by trying out various options in gateway settings.

Backend team utilized Google Trace to verify a 2-3x drop in average response times for the product, and got positive feedbacks from customers and coworkers, CTO that “the app felt faster than before”.

UX is about making the users feel comfortable using the service. And sometimes the solution for effective UX is so simple that it doesn’t require months of planning; it can be done in one sitting given the autonomy of a small task force. If you’re looking to hack a product/service growth leverage, focus on intuitive ways to improve UX instead of adding new features for short-term metrics.

How to do knowledge transfer

One of the first things I do when I join a team is to write a knowledge transfer document that is usually written when someone quits the team and passes on information to the next person. The inspiration to do this came from an absurd turnover rate of organizations (but a rather normal one in the battlefields of startups).

Imagine being deployed to the frontlines, and you are responsible for some soldiers second in rank, and guiding student soldiers still in school uniforms. You know lot of them won’t be here tomorrow. You might not be here tomorrow. How do you ensure that the battle goes on?

A person joining the team or a person being left behind needs to know four things:

  • Work you’ve done (Work List)
  • Your achievements or things you have solved successfully (Achievements List)
  • Your mistakes (Mistakes List)
  • Who to contact for what issues and how (People List)

People list is best transferred one on one rather than documentation for obvious reasons — people change. In startups and SMEs, people change more often and faster than code would change.

Some things you could do with people list:

  • Share tips on how to communicate with “that” coworker; whether they should DM them or find them in office; whether they prefer talking in channels or office halls.
  • Slack channels for what and why / who made them for what purpose
  • Who to ask for X
  • Different scenarios you encountered (good and bad) with other coworkers, leaders and how to respond to them
  • How to interact with people from another department or team (if there are protocols to follow)

What not to do sharing people list:

  • Their personal histories and issues, things you exchanged on personal terms outside work hours i.e their illness history, anything related to family

(To be continously updated)

The reason you join a team will be the reason you leave it

Unless you believe that having X years of experience speaks for itself / completely reflective of your skills, you will most likely look for and change teams or companies that satisfy your career needs.

At some point you might wonder, am I doomed to switch jobs, maybe even whole career paths, until I retire? When do I stop? Am I being greedy, selfish, irresponsible?

What matters for the story of one’s career is to be honest with the past, present and future selves. Your past self may think it’s inappropriate to hop jobs so frequently; your present self may view it as a sign of competence. These relative measures are not useful in creating an absolute narrative of one’s career and constructing the meaning of one’s life. Why would we strive so hard to build a career or strive to do X after all?

There is only one pattern that happens regardless of one’s perspectives at a given time: reasons why one decides to do X become reasons why one undo those decisions later. There’s no right answer to how long a person should stay, and no right answer to why a person would leave; however there can be an answer one can make for one’s narrative: I do not know if I will leave tomorrow, a year or decade later, but I know I will leave when the boundaries of the reasons I created no longer surround me.

Final words: You’re not an imposter, you’re a beginner

On a Discord channel I saw a concern posted by someone that they won’t be able to do “one person’s work” and be a useless beginner.

It reminded me of a time as a struggling graduate; fresh into work force and being told that I’m “just a beginner”, or not meeting the standards at work. The job did not meet my standards either, and there were no more work to do at exit stage, so I proposed resignation.

I theorize that there is great fear of being an imposter or not doing “one person’s work” because we are afraid to admit that for most things, we are complete amateurs.

We only appear like experts because we have digged rabbit holes as amateurs and crawled out of them, but that does not guarantee experts don’t dig holes. Someone could point to an expert digging a hole and say, “what a beginner”, when that person has been digging holes for years and is just trying a different way of digging a hole — or could just geniunely be digging holes for its own sake.

What people really mean when they accuse you of being a beginner is that you do not provide the values or skills required in that organization at that time. This is not personal — though it may feel like one because it is often the same way relationships break. There are many ways to cope with it.

My strategy, at least until now, has been to move to organizations that will value my skillsets, and continue to hone those skillsets in a way others can’t replicate. I believe that this is not just beneficial for personal growth, but also for companies that are looking to grow fast (I specialize in stabilizing fast-growth organizations / exits) and need my expertise.

Others will have other strategies depending on the nature of their skillsets, personalities, stages of life etc; there’s no right answer in strategy as is in life no matter what’s posted on LinkedIn or walls of CEOs.

For further reading…

--

--