masudio - tech

Learn AI Independently: How to Use ChatGPT and Other Tools to Understand a Technical Book

Masud Khan — Wed, 24 Sep 2025 07:00:00 GMT

I've spent the past 13 years in software engineering—and, call me a masochist, but my favorite part is still running face-first into unfamiliar tech and coming out the other side. If you’re a jack-of-all-trades like me, you know the drill: you can dance through AWS, sprinkle some YAML, chat about machine learning infra, and even nerd out about k8s clusters, but—rarely do you wake up feeling like “The One” for any particular topic. You know, the person who the buck stops with.

To get there, you’d have to live and breathe the material. Sweat the APIs. Debug the dark corners. Ship the damn thing yourself. But here’s the secret: you can't fake expertise. You need a foundation built on honest effort. For me, books and courses have always been the gateway.

But here’s the rub: it’s too easy to fool yourself into thinking you’ve “mastered” a book just by reading it. Absorbing core technical material is a multi-step grind — you’ll need to let your brain chew on the information, test yourself, apply it, and loop back for more. Passive reading? It’s like a cheat day for your neurons.

Recently, I picked up Architecting Data and Machine Learning Platforms. I wanted to inhale the knowledge—not just skim it. So, I dusted off three tried-and-true tactics:

Read a section, highlight mercilessly, jot down the essentials for rapid review.
Summarize those highlights at chapter’s end—like your own private CliffsNotes.
Find a study buddy who’s willing to nerd out, challenge my takes, and call out the chapters where the author gets delightfully vague.

But this time, I upped the ante and recruited a new tutor—the Large Language Model (LLM).

Here’s where it gets spicy:
Imagine you’re enrolled in a graduate seminar. You don’t just read—the professor drills you on concepts, throws pop quizzes, and points out where you sound like a hallucinating AI. That's what I wanted. Why not let an actual LLM do the heavy lifting on the quizzes, feedback, and meta-level nitpicking?

So, I gave it a shot. After each chapter, I fed my best notes to two different AI windows (let’s call it “double-barreled learning”): one ChatGPT for speed, one for patience (shoutout to o3 pro deep research mode). Yes, I ponied up $200 for early access—because if you’re not burning money on AI subscriptions, do you really love learning?

The results? Ridiculously useful quizzes.

Varied formats.
Deep recall questions.
Surprising nuance.
Instant feedback without that “please see me after class” embarrassment.

Best of all, it felt low-stakes. Getting something wrong was honestly great—since the LLM would break down exactly where I missed the mark.

(I know, I know: “LLMs hallucinate! You’re just getting tricked, Masud.” Go ahead, be skeptical—I dropped the quizzes and notes for you to judge below.)

Why did this work? The secret: grounding. By feeding only my curated chapter notes, the LLM couldn't stray into mainframe poetry or invent quantum acronyms. Want less BS? Feed it better data.

I ran this experiment across three platforms:

GPT 4.5 and o3 pro deep research (early days).
Later, Perplexity Pro (because I’m a completionist and apparently love paying those premium AI rates).
Sometimes, I even overlapped subscriptions to see which AI UI made me sweat harder on the quizzes.

Spoiler: For any leading LLM, when you ground the prompts in real notes, they all perform great. Minor differences, maybe deeper question angles, but the fundamentals are solid.
Pro tip: Do this with any recent model—ChatGPT, GPT5, Claude Opus/Sonnet—the tool matters less than your process.

Give it a try!
If you’re prepping for a job interview, building your knowledge for a killer work project, or just want to pass as The One in your Slack channel, this workflow is gold.
Feed your notes. Build your own quizzes. Let AI drill you until the material is second nature.

Would love to hear how it goes for you!
If you try this, drop a comment. I want to see what methods you invent—and which questions stump you most.

Ready to see the raw experiment?
Notes and Quizzes: GitHub Repository

Don’t let anyone tell you AI can’t teach you something new—they’re just hallucinating.

Uber's Michelangelo vs. Netflix's Metaflow

Masud Khan — Fri, 16 May 2025 07:00:00 GMT

Originally published on Blogger. This post compares two major ML platform approaches from industry leaders.

Michelangelo

Pain point

Without michelangelo, each team at uber that uses ML (that's all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.

It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.

In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc.

Solution

Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibility for all other ML use cases.

It's built into 3 main parts:

Control Plane - ML Engineers / Data Scientists / Applied Scientists interact with this layer to do their work. It's kind of like the frontend layer of the ML Platform
Offline - Model training, evaluation, tuning/autoML, batch inference, running large scale jobs
Online - live inference, production user interactions

Most recently they've added lots of features to make LLM development easier, such as integrating with huggingface to make open source LLMs accessible to use and fine tune, and prompt engineering environments to iterate on.

With Michelangelo, running workloads on Ray/Spark and just getting an ML project off the ground is no longer a heavy lift. And maintaining an ML project is manageable for product teams.

Metaflow

Pain Point

After prototyping, product teams need to ship their ML projects to production. Doing so can be very time consuming because of the variety of systems each project needs to integrate with in order to ship to users.

Product teams and ML engineers already have enough technologies they need to stay up-to-date on - adding in all the Netflix production dependencies and integrations required to ship a project to prod is overwhelming and a waste of ML engineer mindshare, when that could be handled and managed for them centrally.

There are a few key types of systems that need to be deployed to:

Cached batch inference-style data/KV API's
GPU-backed live inference APIs

And there are also systems that are not live or user-facing (user could be internal creatives or actual Netflix users) but need to be integrated with to allow for engineering progress:

workflow/compute orchestration layer
Knowledge graph
Explainer infra

Solution

Metaflow provides a user-friendly API and integrates with all of the most important systems on the path from ML idea to user-facing product/feature. It allows for extensions to be written by practitioners, and is an open-source project too.

Metaflow integrates with:

(Fast) Data - there's a software layer on top of the main data lake (S3 Iceberg tables) that find and pulls the correct data, and another layer (Apache Arrow) that efficiently (zero-copy) converts the data to streamable frames and allows for user code to process it
Compute - e.g. there's a layer that gives metadata about a model and the env it was trained in, so that explainer models can be built
Orchestration - it allows 'flows' to be triggered in an event-driven style, so any user code can 'trigger' a Metaflow 'flow', and any 'flow' can trigger another 'flow'
Production - it allows various paths for deploying to production

Compare & Contrast Michelangelo and Metaflow

Metaflow integrates fast data streaming, Michelangelo doesn't. Metaflow focuses on common compute/data primitives, Michelangelo goes higher up the stack to common ML tasks like autotuning, evaluation and general frontend UX for ML engineers/DS/AS. Michelangelo solves for resource allocation/prioritization and capacity efficiency via sharing between different teams, Metaflow doesn't.

Key Differences

	michelangelo	metaflow
Key differentiator	Resource sharing, unified UI	data streaming, model explainers, event-driven
Architecture	Control plane, offline, online	storage(S3,etc.), data streaming, event-driven computation DAGs
Fast data/last-mile data processing	???	S3-iceberg,parquet, metaflow.Table to find/load, MetaflowDataFrame/Arrow to stream
Compute	K8s, custom CRD controller, Spark & ray	titus(k8s), spark for ETL, dependency management (user-friendly layer on top of docker)
orchestration	resource sharing x-team	*Maestro workflows - DAGs. it's the metaflow project backbone.

Event-driven arch | | Model hosting | feature transformation graphs bundled with model graphs for deployment (improves train/serve skew)
Gen AI Gateway? | Metaflow hosting - models/artifacts from metaflow deployed here.
Autoscaling, ops/observability | | cloud | OCI, GCP | Probably AWS | | scale | 5k GPUs, 400 projects, 20K training jobs/mo.
5K models in prod, 10M QPS peak | \ ??? GPUs, hundreds of projects | | UX | MA Studio - 1 unified tool, Gen AI Gateway (newer) User submits jobs | Human-friendly APIs User defines DAGs and triggers them with events | | Governance | For models only: auditing, policy guardrails, PII redaction
**(does data-level governance live elsewhere?) | ??? |

Key Similarities

Consolidation of engineering effort for running ML jobs (platform)
User-friendliness
- Michelangelo uses MA Studio UI
- Metaflow uses a human-friendly API
Some overlap in 'primitives' offered (compute, data, workflows

Summary

Uber's Michelangelo and Netflix's Metaflow illustrate two viable, yet opposing, theories of ML platforms: unified ML experience vs. pluggable compute/data primitives. Here's where they are similar, where they contrast and where they both falter.

A consolidated ML platform is something both systems agree is necessary and valuable. Engineering effort required to…

setup access to compute and data
request/provision/allocate compute resources
Orchestrate workloads
Allow human-friendly observability of job state / progress / history

…is non-trivial and having individual teams do this each on their own is wasteful. Having a platform is no longer a differentiator for an organization's ML teams, but a requirement at a certain scale.

But the devil is in the details - each organization is different, and these 2 systems have evolved to prioritize serving other common ML features/concepts differently. Michelangelo excels at higher-level abstractions such as a unified UI and tracking/sharing idle compute resources (GPUs) across teams. Metaflow delivers on more robust distributed systems primitives such as its data streaming framework and an event-driven architecture. Metaflow also provides support for model explainers, a key use case for Netflix.

The one area neither of these systems (or at least the blog posts about them) touches on is data governance. Michelangelo has features for model governance, but it appears these features were added later and may not pertain to the data used to train those models. Metaflow has no mention of security, policy or audit trails. Governance can feel boring to many, but in any organization large enough to have an ML Platform, it's likely an important topic and it's a shame the systems don't go deeper on it.

Still, Michelangelo and Metaflow are excellent examples of ML Platforms at large organizations.

Further Questions

Feature store parity – Michelangelo's Palette is front-and-center; Metaflow leans on Iceberg + Fast Data. Do they solve the same latency/ownership pain, or is one focused on engineering reuse and the other on developer velocity?

ChatGPT - How Long Till They Realize I'm a Robot?

Masud Khan — Sun, 01 Jan 2023 20:00:00 GMT

I tried it first on December 2nd...

...and slowly the meaning of it started to sink in. It's January 1st and as the new year begins, my future has never felt so hazy.

It helps me write code.

At my new company I'm writing golang, which is new for me, and one day on a whim I think "hmmm maybe ChatGPT will give me some ideas about the library I need to use." Lo-and-behold it knew the library. It wrote example code. It explained each section in just enough detail.

I'm excited....It assists my users.

I got a question about Dockerfiles in my teams oncall channel. "Hmmm I don't know the answer to this either"....ChatGPT did. It knew the commands to run. It knew details of how it worked. It explained it better and faster than I could have.

Now I'm nervous....It writes my code for me.

Now I'm hearing how great Github Copilot is - and it's built by OpenAI too...ok I guess I should give it a shot. I install it, and within minutes it's already helped me complete a coding task far faster than I would've done it alone.

Ok now it's full on fear....It assists my oncall.

I get paged in the evening. I haven't handled a page on my own at this company yet - I know I need logs but don't know how to get them. ChatGPT does. It knows the commands. It knows how to filter the output. It knows how to aggregate it. It knows all the details. If it had the right permissions would it be able to diagnose the issue itself?

Full on career existential crisis!...It invades my conversations.

I see many friends and loved ones over the holidays. I can't help but talk to every one of them about it. My wife is annoyed - I can't blame her; by the 7th conversation about ChatGPT I'm annoyed by the sound of my own fear. But the thing is, most start out skeptical but become impressed after trying it themselves for some work-related function.

Now I feel vindicated in my fear and determined to be the first to warn everyone!...My friends bring me down to Earth.

Finally I speak to some friends who are completely unimpressed. They lead teams in AI/ML work and have heard the questions about ChatGPT at all-hands over and over throughout December. "What are we doing about ChatGPT? Have we thought about using it? Is there a risk to our business?". To them it's just another tool that'll make us a bit more productive. Our real job is to figure out which problems to solve, which ChatGPT cannot do. Initially I feel angry that they aren't playing "The Americans" to my "Paul Revere". But after I sleep it off, their perspective starts to make sense.

Now I'm just impatient.

Change is here - ChatGPT, Github copilot and other AI tools are useful products that represent a big step forward for AI. But when my head is in the clouds, the perspective that isn't clear is that...the impact it will have IS NOT clear. It could be a sign of advances to come, or it could be just another tool in the white collar professionals tool belt. And the best way to be right about anything regarding the future of tech is...slowly.

Random Replication Leads to Definitive Data Loss

Masud Khan — Sat, 09 Jan 2021 20:00:00 GMT

Lazy non-determinism

Disciplined determinism

In the world of distributed systems, there is a common pattern of replication which is often used to prevent data loss (in reality, we mean data unavailability since usually data gets backed up to disk and disks almost never completely fail) in storage systems.

The pattern goes like this:

1.) Store a data chunk A on node 1

2.) Replicate chunk A to nodes 2 and 3 so that you can lose 2 of those nodes and still not lose data

3.) Repeat for all your data chunks, randomly choosing the node for each replica among all N nodes in your system

If you do this, you can lose any 2 nodes and be sure not to lose any data.

But as systems grow in size, node failure becomes more and more frequent in absolute terms.

If you have a cluster of 10000 nodes, there is a good chance that at any moment more than 2 nodes are failing.

But you can expect to have a small percentage of failures - says 1% - and that should protect you, since it's very unlikely that the 1% of nodes that fail at the same time contain all 3 replicas for any chunk of data.....right?

WRONG!

Yes, it's true that this is how replication is implemented in many popular storage systems such as HDFS and GFS - but the truth is that for large clusters, the probability of data loss becomes 100% over the course of 1 year.

Architectural Characteristics - Transcending Requirements

Masud Khan — Mon, 19 Oct 2020 19:00:00 GMT

Building a system means meeting a set of requirements dictated by the customer.

But the customer isn't always going to translate what they want into engineering terms.

Even if you say please, they might not even know how.

If they ask for an online ordering system, they probably won't specify that it needs to be available 24/7 and auditable for tax purposes.

Yet these aspects could mean the difference between project success and failure.

Those unsaid aspects are called anything from 'nonfunctional requirements' to 'Architectural Characteristics'.

Aspects of Architectural Characteristics

In their book 'Fundamentals of Software Architecture', Ford and Richards define Architectural Characteristics to have 3 criteria:

Specifies a nondomain design consideration
Influences some structural aspect of the design
Is critical or important to application success

Let's dive into these bullets to better understand them.

Specifies a nondomain design consideration

If you're building an ordering system, then order numbers, customers, items and prices are part of the domain.

Reliability and auditability are not, yet they can be a critical part of the system design - that's what makes them architectural characteristics.

Influences some structural aspect of the design

Let's stick with our original example - reliability and auditability.

Reliability could mean having redundancy to ensure that failures have fallbacks and auditability can mean storing data for some number of years.

So these considerations can significantly change the structure of the system.

Is critical or important to application success

If the system goes down, our customer will lose their customer and they may not come back.

And if the system fails to store tax data, then in the event of a tax audit our customer will be found negligent and could incur significant penalties.

Opportunities for project failure abound, and that confirms that these aspects are architectural characteristics in our example.

Summary

We've come up with a toy example, but hopefully it illustrates the concept at hand.

It's important to identify only critical architectural characteristics to consider for your system because there are more of them than you could incorporate into a system without adding unnecessary complexity.

A best practice is to work with your customer to identify the top 3 most important ones.

Any more and your system will become too complex than it's worth.

You can categorize architectural characteristics into 2 columns: operational and structural.

Operational ACs are things related to operating the system and keeping it running - things like availability and performance.

Structural ACs encompass anything related to how the system can be adapted for differing requirements or environments and for how engineers working on it maintain and improve it - configurability and maintainability for example.

It's a complex and nuanced concept and you'll need to wrestle with it for a while.

But as you do your work, try to distinguish design aspects that fit into the 'requirements' and ones that are 'Architectural Characteristics' using the criteria above.

The distinction will serve you as you're asked to focus more on higher and higher level design.

Laws of Software Architecture

Masud Khan — Sun, 18 Oct 2020 19:00:00 GMT

As the discipline of software engineering matures, few things remain constant.

A few years ago, a large portion of the community thought that TDD was always the best methodology to use - after all, you move faster by being thorough and preventing bugs.

Nowadays it's clear that unit tests do not ensure a successful project - when you're A/B testing new features or products and you're not even sure if your code will be around 3 months from now, it's better to churn out production code and test it in prod.

You can clean up the mess later, if the business value makes it worth it.

But despite the pace of change in software engineering, in software architecture there exist a set of 'laws' which don't change over time and which apply to all architectural endeavors.

I'd like to talk about 2 of them in this post as they are important ideas to keep in the back of your mind as your skills grow past coding, past design and into architecture.

Laws of Software Architecture

Law #1: Everything in software architecture is a trade off

There are no 'yes' or 'no' answers and no quippy one-sentence solutions in the world of software architecture.

If you find an exception to the rule, you're probably missing a lot of aspects of the problem OR you're not working on architecture.

When a problem is an architectural problem, the business requirements are only 1 aspect of the challenge at hand.

Other aspects require effort to discover and can dramatically alter the way you would frame the problem.

For example, say you were building a batch processing system and that the only requirements given to you were that the system has a queue to accept new jobs even if other jobs are being processed, and that all jobs submitted to it were completed as quickly as possible.

You might focus on the performance aspect and say 'we'll optimize the number of jobs running asynchronously to ensure we get the best throughput'.

Seems simple - but there's so much more missing.

What if the system goes down and can't even queue new jobs?

Well that'd be bad, so let's add in reliability as a requirement.

Also, if the system did go down, would the existing jobs in the queue be lost?

Ok, now we'll have to ensure the system is recoverable too.

But depending on how we design for recoverability, we may hurt performance - if the enqueue/dequeue operations need to go to disk, it'll increase latency...so now we've discovered the first trade-off and we're finally doing some architecting.

There are pros and cons to any architectural decision and they must be surfaced in order to make sound, well-thought-out decisions.

If you don't have an explanation of the trade-offs and alternative options to an architectural decision, you're missing something.

Law #2: 'Why' is more important than 'How'

Don't get me wrong - architects are expected to have a breadth of knowledge in different technologies/domains so that understanding 'how' to do something is never a bottleneck.

Architects can often look at the design of a system and quickly understand 'how' it works.

But that skill is useless at best and dangerous at worst without the wisdom to ask 'why' as a first course of action.

There are infinite ways to solve any architectural problem - but without clear and correct answers to 'why', your systems will overlook key architectural characteristics that are important for project success while focussing on others that have nothing to do with the problem domain.

It doesn't matter so much what you build - it matters that you can convince yourself (not to mention the stakeholders) that the reasons behind every decision is inline with the end goals of the project and reflective of the project priorities.

Consensus: The Hard Kind

Masud Khan — Mon, 12 Oct 2020 19:00:00 GMT

You're on a team, undoubtedly. You have been tasked with solving a customer problem and you have a design ready and waiting for review. One team member reviewed an early version and asked for some tweaks, but after an iteration they agreed it was the optimal path forward. You open it up to a wider audience for further review. But then another team member pipes up...

"This design doesn't work - it's just not possible with our current setup"

Well, damn! What do you want me to do? Do I need to cycle through every team member and get them to give their feedback, iterate and incorporate it and then present it to everyone? Even if I do that, by the time I get around the group, someone will disagree with the design again!

Calm down, breathe - this is what it feels like to learn. And what you've just experienced is what I like to call the 'Hard' version of Consensus.

Every team is different in how they deal with this consensus problem, but if the input of your teammates is important to your work, then your team will solve it with some version of what follows.

First, gain experience in the domain to gather the constraints. You need to understand the problem domain deeply so that your first iteration of design starts somewhere reasonable.

Second, propose a reasonable solution. With enough domain experience, you'll be able to anticipate what the feedback will be and correct for it where you can and defend it where you can't.

Third, share independently to each team member who you know will have the most critical feedback. Record what they say and make it a dialogue - it's a conversation, not a dictation.

Fourth, update the solution to incorporate the feedback - where there are conflicting constraints, get the relevant team members in the same room (or the same IM thread, or post, or whatever medium you collaborate on) and hash it out - the conflicts will become apparent to your teammates and with the right level of trust, you'll be able to brainstorm and agree on a solution. Repeat this step as many times as needed.

Fifth and finally, 'present' the design to a wider audience. 'Presenting' could mean an in-person team meeting, or it could mean a document sent to the team email list or group - again, it depends on the medium your team uses to collaborate. Make sure the reasons for each decision were documented along the way - but don't explain every single one of them in the wider presentation. It's too much info for those seeing it for the first time. Instead, have those explanations ready when questions come along so that you can hand-hold your audience through the same discovery process you went through to come to the same conclusions. You may still discover new issues/changes necessary in this wider meeting via feedback, but if you've done steps 1 to 4, there should be enough support in the room to help defend against any large modifications.

This series of steps is simple and straightforward on paper, but it requires assets that one can only earn through experience: patience, EQ, the trust of teammates, crucial conversations and - oh, yeah - technical know-how. It's no wonder senior engineers come few and far-between. But the best part of earning these skills is that they apply to so many other parts of life, unlike coding and system design.

Mentoring a Software Engineer

Masud Khan — Sun, 04 Oct 2020 19:00:00 GMT

Whether it's about raising a small child or directing a full grown engineer towards her next promotion, mentoring requires a great deal of patience, honesty and self-awareness. One wrong step and you could trigger a shame spiral. Or just as bad, you could push your mentee to leave the team for a place with better career and/or learning opportunities. There are clear rules of engagement to follow, but once you've mastered those, doing it right becomes an art more than a science, so bring your creative energies in order to drive excellence.

A common next step after becoming a competent contributor on a software team is to start helping others onboard. When a new teammate joins, having a mentor can accelerate progress by showing them the next steps in their career path rather than letting them hack it out on their own.

But being successful as a mentor can mean many things - the KPIs for mentoring are not as clear as technical projects. You need to have the trust of your manager and to maintain that you need to communicate often about how your mentee is doing and you have to develop the skill of introspection. There are 4 important rules to mentoring that you can use to keep yourself on track and help assess how you're doing.

Ownership

Own your mistakes in guidance and drive ownership in your mentee about their mistakes in execution. You're not going to do everything perfectly, even if your mentee expects you to. Oftentimes an overconfident new engineer will have an expectation that any mistake you make in guidance will significantly delay their career. Other new engineers may be too afraid to bring up a mistake you made for fear of repercussions. Either way, it is your responsibility and privilege to learn to own up to your own mistakes without attributing infinite negative impact to them.

At the same time, you must also be clear when your mentee has made a mistake and ensure they do what's necessary to either fix it or ensure it doesn't happen again. If you start to take too much responsibility and attribute every one of their failings to a mistake in guidance, you will actually hinder their growth by making them feel dependent as if they have no agency. Making it clear that a mistake they made is on them is one of the best things you can do for your mentee.

It's a fine line between taking responsibility and assigning it, but over time you'll see more clearly where that line needs to be.

Raise the bar, don't lower it

When you joined and ramped up on your team, there was less knowledge than there is now. If that's true, then you should already be expecting more from your mentee than you would've expected from yourself when you joined. Some mentors will allow themselves to believe they are special and that no one else could learn as fast as they did. If you do this, you'll set a bar that is too low for your mentee instead of allowing them to show you what their limits are.

The caveat is that you need to be specific about how you're providing them guidance and knowledge that is above and beyond what you received when you started. If you do this well, you'll see them excel and sooner rather than later you'll have a productive member of the team who you can delegate work to and trust they'll do what's necessary to get it done.

The bar for performance at any organization is always moving, and that means either you're lowering the bar or raising it - there's no in-between.

Set boundaries

Mentoring is not a full-time job - you still need to get your own work done and drive your own career forward in other ways. But your mentee will probably take as much of your time as you're willing to give. So you need to be clear about when you're able to offer help and when you're not. If this is your first experience mentoring, you might not be used to saying no to a meeting request or delaying a reply to an instant message so that you don't get context-switched. But it's important to do because it gives you time and focus to be productive and it drives independence in your mentee. If they don't know the answer to a question, they will soon find that it's faster for them to figure out how to answer it themselves - it could be looking through the company wiki or digging through code. These skills are essential too.

Check yourself

Do you feel threatened by your mentee? Do you feel resentment? Do you have a strong enough driver to ensure your mentees success? You have to be introspective. With professional relationships it's easy to be put in a position where your mentees success is not aligned with yours. Talk to your manager if you feel this and get clarity on what happens if your mentee is successful or not. The best thing you can do for yourself and for your mentee is to align your mentees goals with your own goals. You'll feel energized by the work of mentoring and your mentee will trust you more as they see the value you're providing for them.

In a good mentor/mentee relationship, growth can be experienced by both parties. It's important to treat it as another important function of your role at the organization. Be warned, if you pass it off as just another time sink, your influence at the company will wane over time and you'll find yourself quite lonely as you churn out code (or whatever it is that you build). On the flip side, if you focus your attention on this skill and help your mentee and your organization grow, you'll become an integral leader and you'll be well-respected by your peers. A little personal attention goes a long way.

Grow Your Career, Be A Senior Engineer

Masud Khan — Sat, 11 Jul 2020 19:00:00 GMT

My personal experience with navigating a career in software engineering has been dotted by fits and starts. 9 years into my professional life I can look back and see what worked, what didn't, and why. I want to share that knowledge so other young engineers can do more with their first 9 years than I did. Plus it's just fun to reminisce and will help me visualize the next steps in my career.

To be clear, the north star goal in my career is and has always been to learn and grow towards more positive impact on those I serve. If you have the same or a similar goal, this article will be useful.

I wrote up a simple formula which I think can help any aspiring engineers out there. It can be used as a template for a career plan. When you're thinking about how to get to the next level, make sure you're putting equal emphasis on all these categories.

1.) ACTION from you, despite imperfect information 2.) COWORKERS whom you respect 3.) MENTOR(s) invested in your growth and success

1 is the only absolute essential in this formula. Nothing happens without it. 2 and 3 will mean nothing without 1, but the presence of 2 and 3 will magnify 1. 3 is easier to find if you already have 2.

Action

Action entails the time and effort you put in to improving the skills you use to do your job. Action must happen at work, as you do your job every day, and also outside of work when you 'Sharpen the Saw'. If you receive feedback from your manager that you need to do a better job to communicate your designs, your Action could be to do a presentation on them during a team meeting. If you feel you need to improve your C++ skills, your Action might be to join a weekend C++ class.

Action is the category you'll spend the most effort in - this category requires nothing else, but with information/feedback from high quality coworkers and mentors, the Actions you choose will be better targeted and personalized for your own growth. There is no substitute for Action - even with the best coworkers and mentors, you will get nowhere without it.

Coworkers

You're the average of the 5 people you spend the most time with. If you're like most people, those 5 are likely to include at least a couple of Coworkers. Coworkers shape the way you think about problems you're solving. They influence how you frame the questions you ask, before you've even thought about asking them. Quality coworkers will get to know you, make you feel welcome and then, most importantly, give you clear, actionable feedback. This feedback becomes more and more valuable over time as you start to see patterns in your own behavior. Over time, your strengths become clearer and clearer and your weaknesses too.

This category isn't something you 'do'. It's something you assess over time, and decide to change if/when you need to. By staying on a team where you don't feel growth and where you're not becoming more valuable to your customers (the people you serve) over time, you are sacrificing the multiplicative effect that good coworkers can have on your Actions over time. Do yourself a favor and assess whether you're on the right team every 6 months or so.

Mentors

Your coworkers give real time feedback, but because they work with you every day often on the same goals, they aren't strongly incentivized to help you grow in your career. That's where Mentors come in. Mentors help you build the muscle of looking at your career from a wider view and asking larger questions. They also can connect you with opportunities outside what you currently have. A Mentor must be someone you greatly respect because they need to be able to influence you to make changes and take action while at the same time making you feel optimistic and excited about your career. You also need to find someone with more experience and who has a similar career path to what you're hoping to achieve. If you do that, you'll have incites from your future self and you'll be able to make progress to becoming that person sooner.

A lot of times, Mentorships are formed organically - perhaps from a former coworker. This may be because you've already built rapport with the person, but it also might be because they understand the challenges you face and know how to help. In any case, Mentors can be very hard to find for some, so don't feel worried if you get time from them rarely - the bits you get can still be quite valuable.

This category is another one that you don't 'do', but instead you'll have to be on the lookout for good mentors and put yourself in settings where you're more likely to meet them - conventions, meetups, even events your company holds for networking. It's not that you put a great deal of time into it, but instead just maintain good relationships with those you meet who have aspects of a career you want to emulate.

Why are Distributed Systems So Hard?

Masud Khan — Sat, 01 Feb 2020 20:00:00 GMT

Isn't it true you can just write deterministic code and if you do it right and work to fix all the bugs, eventually you'll have a simple that never does the wrong thing?

If that's not the case then why not?

Computers are deterministic - they're predictable and they only ever do exactly what you ask them to do and nothing more...right?

It's complicated.

On a single node (computer) most failures mean the entire node completely stops working - for example the power supply fails, the disk dies, or the motherboard gets fried.

These types of failures are easy to detect because the node won't be in a 'partially failed' state where sometimes it performs the functions it's asked and sometimes it doesn't.

In a network, you have multiple nodes - possibly hundreds or thousands.

When one node fails, it is impossible to be 100% sure that node is completely down.

And if you can't be sure that a node won't come back to life, then you can't simply give its responsibilities away to another node.

In addition, if one node fails, for many important algorithms, you have to make sure all the other nodes are aware of the fact.

But why is it so hard to be sure?

Computers do a lot of weird things.

For example, programming languages use garbage collection to clear out items from memory that are no longer used.

This can sometimes be a long process, especially when there are lots of large items that need to be garbage collected.

But it's non-deterministic how long a garbage collection run will take.

When garbage collections happen, the process running on that node goes through a pause - a period of time when no other work gets done.

During this time the node can look as if it's down and out.

Since the node won't respond to requests during this time, an outside observer might conclude after some time that it's dead.

But after the garbage collection finishes, the process will come back to life and start responding to requests and doing useful work.

In addition, each node has its own system clock.

That clock can have bugs, or become slow or fast.

So when you're measuring how long it's been since a request to another node was sent, the measurement you get has the potential to be wrong.

Networks themselves can be finnicky too.

Nodes that are only accessible through a single edge can easily be cutoff if that edge goes down.

When that happens it's called a network partition.

The link will eventually be restored, but during the time it's down, it will be impossible to tell from one side the state of the nodes on the other side.

So you have to make guesses about the state of each node with varying levels of confidence.

You use timeouts, retries, locks and more complex consensus algorithms to manage the uncertainty about the state of each and every node in the network at any given point in time.

In short, the reason distributed systems are hard is because of non-determinism caused by process pauses, requests with no response and out-of-sync system clocks.

Cluster Management at LinkedIn

Masud Khan — Wed, 22 Jan 2020 20:00:00 GMT

In 2014 LinkedIn released a cluster management solution called Helix. Helix solves some problems that arise when a system scales to be too large to manage even on just a few hosts. A successful system will start to go through a few transition states that, when large enough, will become frequent enough to require an automated solution.

First, your system will become too large to host on a single machine. So now you need to shard it.

Then your system will either have hosts fail once in a while, or some shards might start getting too big or taking too much load. So then you can start using replication to solve for that.

As your cluster grows, the average size of shards also grows - sometimes you'll have to split shards because they become too big, or more broadly redistribute. So now you need something to allow for that.

Partitioning/sharding, fault tolerance and scalability - these are the higher level concepts just described, and the problems Helix solves for. If you can solve these problems, there's a good chance your next bottleneck will be TCO (Total Cost of Ownership). With more hosts being used, you'll need more efficient resource utilization - and you can get that with multitenancy. Multitenancy means allowing multiple tenants within a single process - you might have one tenant that's high in CPU consumption but low in memory hosted on the same machine as another process with high memory usage and low CPU. This is cheaper than having those 2 tenants on 2 different machines with the same resources. If you have the ability to easily redistribute load, as Helix has, then you also have multitenancy, so that problem can be solved with Helix as well.

Although Helix does a lot, there are other problems it doesn't solve. If you have load fluctuations on your system such that different shards/replicas will have their load increase or decrease at different times of day, then you'll want those replicas to move around to different hosts so that load is always well distributed. But in order to have that, you'd need a system that monitors load metrics on each server in real time and is able to move shards around on the fly when that load fluctuates across the cluster. That means dynamic load balancing, and as of today, Helix does not support this.

Despite its shortcomings, for a system built 6+ years ago, Helix has stood up well against the test of time. Plus it's open source, so it gets bonus points for contributing concrete value to the world of software engineering.

The Curious Case of the Document Database

Masud Khan — Sat, 21 Dec 2019 20:00:00 GMT

Let's talk about an oft-overlooked NoSQL database type. It's got all the best parts of a k-v store and allows limited SQL-like query-ability too! They're called document databases and they're all the rage.

A document database stores objects that can be serialized to JSON or some similar serialization format. These JSON 'documents' are keyed by an ID, similar to how k-v stores work. When you want to fetch an entire document, all you need is the key for it. But the magic of document databases allows you to fetch only pieces of a document and also to fetch data from multiple documents using selection criteria that mirrors basic SQL query functionality.

This is all made possible by the tree-like structure that documents in a doc DB must conform to. JSON data can contain keyed fields, nested structures, and lists. Using this structure a doc DB can extract specific pieces of a doc so that the entire doc doesn't have to be returned and parsed in the application layer.

Query-ability, the other interesting feature of doc DBs, is a step above k-v stores, though not as powerful as a SQL DB. You can filter your results based on different fields in documents since the documents have structure to them. There's no strict schema, so some docs will have fields that others don't, but you can still run the same queries on those docs. To improve performance, you can use high-selectivity fields (fields where there are many different possible values among docs) as query filters and add indexes to limit the computation further.

Like other NoSQL database types, document databases also enforce no schema. 2 different keys can point to docs with entirely different sets of fields. This means you won't have to do expensive database migrations each time a new field is needed.

MongoDB is the most widely known open source document DB and has all the features listed above as well as many more that will allow you to scale both reads and writes easily. Check it out next chance you get and you'll certainly find a use for it in one of your projects.

Consistency in Redis

Masud Khan — Fri, 13 Dec 2019 20:00:00 GMT

Most uses of Redis will focus more on latency and availability rather than consistency - that's because at its core, Redis is essentially a cache. Generally speaking, you store things in Redis in memory and you update or read them extremely quickly. You need to make sure that the cache is always available, so in most cases you'd only choose Redis if you're leaning towards an A class system (A for Availability) rather than a C class system (Consistency).

However, it's important to know that a replicated instance of Redis is capable of giving you different levels of consistency up to and including read-after-write consistency - the kind of consistency that guarantees data reads from anywhere that happen after a successful response to a write will receive that write. Even if the read goes to a different replica than the write did. What Redis can't give you is linearizability - the guarantee that any set of observers of the system will only be able to see a single copy of the system at any point in time. Redis doesn't offer distributed transactions out-of-the-box and therefore can't provide this along with a replicated instance. And if the instance is NOT replicated but is sharded, then linearizability can only be achieved for writes to the same machine.

The way to get read-after-write consistency with a replicated Redis distribution is to:

1.) enable WAIT, and specify all replicas - this means writes will not succeed on the client side until all replicas have received the write and responded successfully.

2.) enable AOF and set it to 'always' - Append Only File means that writes will be pushed to a file using a fast algorithm. Always means that every single write will hit that file on disk.

These settings trade availability almost entirely for as much consistency as Redis can offer - if any replica is down, NO writes can go through. Though reads could still succeed for some users. However, considering that you still can't get linearizability, there's only a narrow swath of applications that would be best served by this configuration in Redis.

Nonetheless, it's possible, and adds another degree of flexibility to Redis.

CAP Theorem Explained

Masud Khan — Sun, 08 Dec 2019 20:00:00 GMT

When building large-scale software systems today, you have to make tradeoffs. You can't have an ACID compliant data store with infinite storage/throughput/connections that's always available in any part of the world with super low latency where clients can read/write concurrently without any risk of inconsistencies that's free. If you could, the problem would be solved and our industry could go build spaceships at SpaceX or retire and make sourdough every Sunday.

Instead, we need to make tradeoffs. Does our product/system need ACID semantics? Is latency more important? Can we allow certain types of data inconsistencies for a short time in favor of availability? How much are we able to spend so that we don't have to sacrifice as much?

These are some questions that everyone building a large-scale software system has to grapple with in the design phase. A great way to begin your thinking is using CAP Theorem - or at least what it's slowly been crystallized into over the years.

CAP Theorem was first proposed in 2000 and proven a couple years later. CAP stands for Consistency, Availability, Partition tolerance. Initially, it was thought that you had to choose 2 out of the 3 in a distributed systems - and that was your tradeoff.

The initial definition of Availability was that if a node in the system is not considered to be down, then it must perform and respond to any read/write requests it receives. Partition tolerance meant that if a network fault occurred causing some node(s) in the network to no longer be able to reach each other, though clients could still connect to at least 1 node, then the system would still work as expected. And Consistency meant that every read request receives the latest write and that writes are never lost.

Practically speaking (and NOT theoretically), Partition tolerance is a must-have. Any real-world network will have faults that will cause partitions and it will happen often - if the network is large enough, there is a network partition at any given moment. If the entire system stopped working because of a partition, a distributed system on a large enough scale would be down at all times.

So it's not really a choice of 2 out of 3 - Partition tolerance is a must, which leaves only a choice between Consistency and Availability. CP or AP - which one fits your use case?

Of course, like all the juiciest problems, it's not a black-and-white tradeoff. It's more of a spectrum. Your system might be able to handle some forms of inconsistency for short periods of time - if it's a news site and updates to an article take a few minutes to slowly propagate to each user-facing node in the system, it's probably tolerable as long as all the nodes get updated eventually.

You might also want to accept some unavailability because you can NEVER accept inconsistencies - if you're a replicated storage system for user data and you're using quorum writes and a majority of the replica servers are down, then you might choose to let writes fail until one of the servers comes back so that user data doesn't get corrupted - but in the meantime, reads will still work from the remaining live servers.

It's a spectrum - and this is the best way to start thinking about building a large-scale software system. To understand all the points on the spectrum, there's a lot of knowledge out there too - I'll write about the different types of consistency on a future post.

The CP and AP tradeoff is well documented in 2 important texts on the subject: Designing Data Intensive Applications by Martin Kleppmann and NoSQL Distilled by Pramod J. Sadalage and Martin Fowler.

XFN Development - What's it all About?

Masud Khan — Sun, 01 Dec 2019 20:00:00 GMT

XFN (cross-functional) work is one of the challenges of a senior engineer in most tech companies.

Broadly, it means to interact with team members of a different team than your own.

Concretely, this can mean anything from aligning goals or gathering feedback from other teams to inform your roadmap, to pair programming to flesh out the design of a new interface.

Your team wants you to make progress on the goals through XFN, but also to make the team look good (ie competent, smart, motivated) to those who are judging.

In my organization, this type of work is reserved for more seasoned engineers - although you're interacting with others on system designs, a lot of it is not the type of stuff taught in CS classes.

It's about personal interactions.

Building Relationships

When you first start talking to a member from a different team, it's important to ensure they feel that you're someone they want to work with.

You should be able to describe the system you own or are building on a whiteboard in a few minutes, and you should be able to understand, follow and ask questions as XFN team members are doing the same.

Besides that, you need to understand the goals of your own team so that you can make sure that the agreements that come out of your interactions align with the direction your team is going.

XFN teams will sometimes try to move work onto other teams in order to save the bandwidth of their own team.

You're still working for the same mission usually, but if you're not presenting the goals and constraints of your own team well enough, you might end up agreeing to take on 6 months of test debugging work in order to appease your XFN colleague.

Clear Agreements

XFN work means you have to think more a like a lawyer - you need to make sure agreements about who is going to do what are clear between you and the other party.

It's essential, or else you'll end up missing your timelines because you thought they would do the work, and they thought you would.

Make the agreements clear - clear enough that you can understand it and describe it to your manager.

Then make sure to write it down and publish it so that your colleague and your manager can see it - in an email or post - somewhere you can reference back to in case anything happens.

Regular Follow-up

Follow-up with your XFN colleagues and sync up regularly.

The workplace today leaves few excuses for not having reached out to a colleague soon enough - you can walk over to their desk, call them, send an IM, and in some workplaces you can post in their user group.

So use the tools you have to check in and sync up to make sure everyone is doing the work they are expected to get done.

Managing Accountability

Even if things are in writing and you follow-up, the fact is that different teams are held to account by different management chains.

If you get an agreement with an XFN team and they aren't following through, make sure you have a plan for what would happen next.

If a timeline for a project starts to go from green to yellow, communicate it to the XFN team member and ask what they're doing about it.

Tell your manager as well - they will want to know so that they can plan accordingly.

High Stakes Work

XFN work is different from technical work like coding and system design.

It requires you to do a small set of very specific things with extremely high consequences if you fail to do any of them, and the only fallback for these things not happening is you.

What would happen if simply missed a monthly check in and it turned out your XFN partner was falling behind?

You'd find out a month later and then have to push back dependent projects back 2 months instead of 1 - that could mean a lot of people angry at you.

In contrast, if you're writing code or designing an API, you can rely on code reviews, design reviews or colleague feedback to make sure you don't miss anything important.

It's rare a week goes by without either someone on your team making sure your deliverables are moving forward, OR you yourself seeking feedback on what you've done so far.

Conclusion

XFN work is less fun in the short term but far higher impact in the long run.

So try not to shove it aside as something boring and unworthy of your attention - if you screw it up, you could get your team in deep doo-doo.

Redis!...Huh? What ISN'T it good for?

Masud Khan — Sun, 24 Nov 2019 20:00:00 GMT

Redis is an in-memory key-value data store that allows you store your actual data structures rather than having a mapping layer between your application and your storage.

Support exists for any data type you'd need including lists, sets and hashes/maps.

It's in-memory but also has options to push to disk - you can push to disk on every write with a huge performance cost, or at some regular interval.

Writes can be configured to happen via an append-only log, which makes them lightning fast. Pushing to disk every 1 second has comparable performance to never pushing to disk at all.

Redis supports replication in a few different ways.

By default it's asynchronous, but can be configured to be synchronous for safety.

Combined with append-only logging on every write, you can have 100% consistency of your data on any successful write.

Redis Cluster allows automatic sharding and handling of many different failure scenarios, so if a small number of the hosts in your cluster are experiencing failures, the cluster as a whole can continue to operate.

When configured this way, if a Redis host is lost, there will be no write operations lost since they will be replicated and written to disk before the write call responds with a success.

Redis works for so many different applications that it begs the question - what ISN'T it good for?

Large amounts of Data

Redis is an in-memory data store, which means your data has to fit in memory.

If the use case requires a lot of data and you don't have money for many machines with expensive RAM, then Redis might be the wrong choice.

You can make it work, but only with more cost and complexity in the form of more granularly sharded data.

Though Redis is in-memory, once there isn't enough space for data, it can start swapping values out to disk.

Keys must remain in-memory always by design (and to ensure fast lookups), but values for the most rarely-used keys can be swapped out to disk once memory runs out.

So, fine - if access patterns of your data mean a few keys are accessed frequently and others are not, then maybe you can still make the case for using Redis.

If your keys are rarely accessed and/or non-latency-sensitive, you should consider using something different.

Redis is meant for use cases where you need high performance lookups and your dataset (at least the keys) can fit into memory.

Relational Data

If your data access patterns require a lot of relations between keys, then Redis will require you to make many network calls before you can get to the piece of data you want for any particular query.

It's not a graph database and it's not a replacement for SQL.

Key-value stores are strong in use cases where a single key can be used to get the exact piece(s) of data you want.

Use something like MySQL or PostgreSQL - or if your data looks like a graph with vertices and edges, use a graph database like neo4j.

Range Queries

Redis has functionality to query ranges, but the performance falls short. If range queries are one of your key needs, you should know Redis often falls short. MongoDB has better performance.

ACID guarantees

Redis is NoSQL - one thing that means is that you don't get ACID guarantees.

If you're making updates to multiple keys, they will not be transactional unless they are under the same hash slot (usually this means key).

So Redis can't do distributed transactions.

However, you could build a 2-phase commit on top of Redis and do it yourself.

It just wouldn't be strongly consistent no matter what you do.

Conclusion

Redis' strength lies in low-latency non-relational use cases where data consistency is a 2nd priority.

High availability is also a high priority for Redis and Redis Cluster improves on that model.

For what it does best, low latency, Redis is absolutely best in class for storage.

If, however, your use case prioritizes large amounts of data, relational data and/or ACID guarantees, don't go to war with your own architecture - steer clear of Redis.