Breaking the quadratic barrier: modern alternatives to softmax attention
Large Languange Models are great but they have a slight drawback that they use softmax attention which can be computationally intensive. In this article we will explore if there is a way we can replace the softmax somehow to achieve linear time complexity.
Image by Author (Created using Miro Board)
Attention Basics
I am gonna assume you already know about stuff like ChatGPT, Claude, and how transformers work in these models. Well attention is the backbone of such models. If we think of normal RNNs, we encode all past states in some hidden state and then use that hidden state along with new query to get our output. A clear drawback here is that well you can’t store everything in just a small hidden state. This is where attention helps, imagine for each new query you could find the most relevant past data and use that to make your prediction. That is essentially what attention does.
Attention mechanism in transformers (the architecture behind most current language models) involve key, query and values embeddings. The attention mechanism in transformers works by matching queries against keys to retrieve relevant values. For each query(Q), the model computes similarity scores with all available keys(K), then uses these scores to create a weighted combination of the corresponding values(Y). This attention calculation can be expressed as:
Source: Image by Author
This mechanism enables the model to selectively retrieve and utilize information from its entire context when making predictions. We use softmax here since it effectively converts raw similarity scores into normalized probabilities, acting similar to a k-nearest neighbor mechanism where higher attention weights are assigned to more relevant keys.
Okay now let’s see the computational cost of 1 attention layer,
Source: Image by Author
Softmax Drawback
From above, we can see that we need to compute softmax for an NxN matrix, and thus, our computation cost becomes quadratic in sequence length. This is fine for shorter sequences, but it becomes extremely computationally inefficient for long sequences, N=100k+.
This gives us our motivation: can we reduce this computational cost? This is where linear attention comes in.
Linear Attention
Introduced by Katharopoulos et al., linear attention uses a clever trick where we write the softmax exponential as a kernel function, expressed as dot products of feature maps φ(x). Using the associative property of matrix multiplication, we can then rewrite the attention computation to be linear. The image below illustrates this transformation:
Source: Image by Author
Katharopoulos et al. used elu(x) + 1 as φ(x), but any kernel feature map that can effectively approximate the exponential similarity can be used. The computational cost of above can be written as,
Source: Image by Author
This eliminates the need to compute the full N×N attention matrix and reduces complexity to O(Nd²). Where d is the embedding dimension and this in effect is linear complexity when N >>> d, which is usually the case with Large Language Models
Okay let’s look at the recurrent view of linear attention,
Source: Image by Author
Okay why can we do this in linear attention and not in softmax? Well softmax is not seperable so we can’t really write it as product of seperate terms. A nice thing to note here is that during decoding, we only need to keep track of S_(n-1), giving us O(d²) complexity per token generation since S is a d × d matrix.
However, this efficiency comes with an important drawback. Since S_(n-1) can only store d² information (being a d × d matrix), we face a fundamental limitation. For instance, if your original context length requires storing 20d² worth of information, you’ll essentially lose 19d² worth of information in the compression. This illustrates the core memory-efficiency tradeoff in linear attention: we gain computational efficiency by maintaining only a fixed-size state matrix, but this same fixed size limits how much context information we can preserve and this gives us the motivation for gating.
Gated Linear Attention
Okay, so we’ve established that we’ll inevitably forget information when optimizing for efficiency with a fixed-size state matrix. This raises an important question: can we be smart about what we remember? This is where gating comes in — researchers use it as a mechanism to selectively retain important information, trying to minimize the impact of memory loss by being strategic about what information to keep in our limited state. Gating isn’t a new concept and has been widely used in architectures like LSTM
The basic change here is in the way we formulate Sn,
Source: Image by author
There are many choices for G all which lead to different models,
A key advantage of this architecture is that the gating function depends only on the current token x and learnable parameters, rather than on the entire sequence history. Since each token’s gating computation is independent, this allows for efficient parallel processing during training — all gating computations across the sequence can be performed simultaneously.
State Space Models
When we think about processing sequences like text or time series, our minds usually jump to attention mechanisms or RNNs. But what if we took a completely different approach? Instead of treating sequences as, well, sequences, what if we processed them more like how CNNs handle images using convolutions?
State Space Models (SSMs) formalize this approach through a discrete linear time-invariant system:
Source: Image by Author
Okay now let’s see how this relates to convolution,
Source: Image by Author
where F is our learned filter derived from parameters (A, B, c), and * denotes convolution.
H3 implements this state space formulation through a novel structured architecture consisting of two complementary SSM layers.
Here we take the input and break it into 3 channels to imitate K, Q and V. We then use 2 SSM and 2 gating to kind of imitate linear attention and it turns out that this kind of architecture works pretty well in practice.
Selective State Space Models
Earlier, we saw how gated linear attention improved upon standard linear attention by making the information retention process data-dependent. A similar limitation exists in State Space Models — the parameters A, B, and c that govern state transitions and outputs are fixed and data-independent. This means every input is processed through the same static system, regardless of its importance or context.
we can extend SSMs by making them data-dependent through time-varying dynamical systems:
Source: Image by Author
The key question becomes how to parametrize c_t, b_t, and A_t to be functions of the input. Different parameterizations can lead to architectures that approximate either linear or gated attention mechanisms.
Mamba implements this time-varying state space formulation through selective SSM blocks.
Mamba here uses Selective SSM instead of SSM and uses output gating and additional convolution to improve performance. This is a very high-level idea explaining how Mamba combines these components into an efficient architecture for sequence modeling.
Conclusion
In this article, we explored the evolution of efficient sequence modeling architectures. Starting with traditional softmax attention, we identified its quadratic complexity limitation, which led to the development of linear attention. By rewriting attention using kernel functions, linear attention achieved O(Nd²) complexity but faced memory limitations due to its fixed-size state matrix.
This limitation motivated gated linear attention, which introduced selective information retention through gating mechanisms. We then explored an alternative perspective through State Space Models, showing how they process sequences using convolution-like operations. The progression from basic SSMs to time-varying systems and finally to selective SSMs parallels our journey from linear to gated attention — in both cases, making the models more adaptive to input data proved crucial for performance.
Through these developments, we see a common theme: the fundamental trade-off between computational efficiency and memory capacity. Softmax attention excels at in-context learning by maintaining full attention over the entire sequence, but at the cost of quadratic complexity. Linear variants (including SSMs) achieve efficient computation through fixed-size state representations, but this same optimization limits their ability to maintain detailed memory of past context. This trade-off continues to be a central challenge in sequence modeling, driving the search for architectures that can better balance these competing demands.
To read more on this topics, i would suggest the following papers:
This blog post was inspired by coursework from my graduate studies during Fall 2024 at University of Michigan. While the courses provided the foundational knowledge and motivation to explore these topics, any errors or misinterpretations in this article are entirely my own. This represents my personal understanding and exploration of the material.
Linearizing Attention was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Reinforcement Learning (RL) is a branch of Artificial Intelligence that enables agents to learn how to interact with their environment. These agents, which range from robots to software features or autonomous systems, learn through trial and error. They receive rewards or penalties based on the actions they take, which guide their future decisions.
Among the most well-known RL algorithms, Proximal Policy Optimization (PPO) is often favored for its stability and efficiency. PPO addresses several challenges in RL, particularly in controlling how the policy (the agent’s decision-making strategy) evolves. Unlike other algorithms, PPO ensures that policy updates are not too large, preventing destabilization during training. This stabilization is crucial, as drastic updates can cause the agent to diverge from an optimal solution, making the learning process erratic. PPO thus maintains a balance between exploration (trying new actions) and exploitation (focusing on actions that yield the highest rewards).
Additionally, PPO is highly efficient in terms of both computational resources and learning speed. By optimizing the agent’s policy effectively while avoiding overly complex calculations, PPO has become a practical solution in various domains, such as robotics, gaming, and autonomous systems. Its simplicity makes it easy to implement, which has led to its widespread adoption in both research and industry.
This article explores the mathematical foundations of RL and the key concepts introduced by PPO, providing a deeper understanding of why PPO has become a go-to algorithm in modern reinforcement learning research.
1. The Basics of RL: Markov Decision Process (MDP)
Reinforcement learning problems are often modeled using a Markov Decision Process (MDP), a mathematical framework that helps formalize decision-making in environments where outcomes are uncertain.
A Markov chain models a system that transitions between states, where the probability of moving to a new state depends solely on the current state and not on previous states. This principle is known as the Markov property. In the context of MDPs, this simplification is key for modeling decisions, as it allows an agent to focus only on the current state when making decisions without needing to account for the entire history of the system.
An MDP is defined by the following elements: – S: Set of possible states. – A: Set of possible actions. – P(s’|s, a): Transition probability of reaching state s’ after taking action a in state s. – R(s, a): Reward received after taking action a in state s. – γ: Discount factor (a value between 0 and 1) that reflects the importance of future rewards.
The discount factor γ is crucial for modeling the importance of future rewards in decision-making problems. When an agent makes a decision, it must evaluate not only the immediate reward but also the potential future rewards. The discount γ reduces the impact of rewards that occur later in time due to the uncertainty of reaching those rewards. Thus, a value of γ close to 1 indicates that future rewards are almost as important as immediate rewards, while a value close to 0 gives more importance to immediate rewards.
The time discount reflects the agent’s preference for quick gains over future ones, often due to uncertainty or the possibility of changes in the environment. For example, an agent will likely prefer an immediate reward rather than one in the distant future unless that future reward is sufficiently significant. This discount factor thus models optimization behaviors where the agent considers both short-term and long-term benefits.
The goal is to find an action policy π(a|s) that maximizes the expected sum of rewards over time, often referred to as the value function:
This function represents the expected total reward an agent can accumulate starting from state s and following policy π.
2. Policy Optimization: Policy Gradient
Policy gradient methods focus on directly optimizing the parameters θ of a policy πθ by maximizing an objective function that represents the expected reward obtained by following that policy in a given environment.
The objective function is defined as:
Where R(s, a) is the reward received for taking action a in state s, and the goal is to maximize this expected reward over time. The term dπ(s) represents the stationary distribution of states under policy π, indicating how frequently the agent visits each state when following policy π.
The policy gradient theorem gives the gradient of the objective function, providing a way to update the policy parameters:
This equation shows how to adjust the policy parameters based on past experiences, which helps the agent learn more efficient behaviors over time.
3. Mathematical Enhancements of PPO
PPO (Proximal Policy Optimization) introduces several important features to improve the stability and efficiency of reinforcement learning, particularly in large and complex environments. PPO was introduced by John Schulman et al. in 2017 as an improvement over earlier policy optimization algorithms like Trust Region Policy Optimization (TRPO). The main motivation behind PPO was to strike a balance between sample efficiency, ease of implementation, and stability while avoiding the complexities of TRPO’s second-order optimization methods. While TRPO ensures stable policy updates by enforcing a strict constraint on the policy change, it relies on computationally expensive second-order derivatives and conjugate gradient methods, making it challenging to implement and scale. Moreover, the strict constraints in TRPO can sometimes overly limit policy updates, leading to slower convergence. PPO addresses these issues by using a simple clipped objective function that allows the policy to update in a stable and controlled manner, avoiding forgetting previous policies with each update, thus improving training efficiency and reducing the risk of policy collapse. This makes PPO a popular choice for a wide range of reinforcement learning tasks.
a. Probability Ratio
One of the key components of PPO is the probability ratio, which compares the probability of taking an action in the current policy πθ to that of the old policy πθold:
This ratio provides a measure of how much the policy has changed between updates. By monitoring this ratio, PPO ensures that updates are not too drastic, which helps prevent instability in the learning process.
b. Clipping Function
Clipping is preferred over adjusting the learning rate in Proximal Policy Optimization (PPO) because it directly limits the magnitude of policy updates, preventing excessive changes that could destabilize the learning process. While the learning rate uniformly scales the size of updates, clipping ensures that updates stay close to the previous policy, thereby enhancing stability and reducing erratic behavior.
The main advantage of clipping is that it allows for better control over updates, ensuring more stable progress. However, a potential drawback is that it may slow down learning by limiting the exploration of significantly different strategies. Nonetheless, clipping is favored in PPO and other algorithms when stability is essential.
To avoid excessive changes to the policy, PPO uses a clipping function that modifies the objective function to restrict the size of policy updates. This is crucial because large updates in reinforcement learning can lead to erratic behavior. The modified objective with clipping is:
The clipping function constrains the probability ratio within a specific range, preventing updates that would deviate too far from the previous policy. This helps avoid sudden, large changes that could destabilize the learning process.
c. Advantage Estimation with GAE
In RL, estimating the advantage is important because it helps the agent determine which actions are better than others in each state. However, there is a trade-off: using only immediate rewards (or very short horizons) can introduce high variance in advantage estimates, while using longer horizons can introduce bias.
Generalized Advantage Estimation (GAE) strikes a balance between these two by using a weighted average of n-step returns and value estimates, making it less sensitive to noise and improving learning stability.
Why use GAE? – Stability: GAE helps reduce variance by considering multiple steps so the agent does not react to noise in the rewards or temporary fluctuations in the environment. – Efficiency: GAE strikes a good balance between bias and variance, making learning more efficient by not requiring overly long sequences of rewards while still maintaining reliable estimates. – Better Action Comparison: By considering not just the immediate reward but a broader horizon of rewards, the agent can better compare actions over time and make more informed decisions.
The advantage function At is used to assess how good an action was relative to the expected behavior under the current policy. To reduce variance and ensure more reliable estimates, PPO uses Generalized Advantage Estimation (GAE). This method smooths out the advantages over time while controlling for bias:
This technique provides a more stable and accurate measure of the advantage, which improves the agent’s ability to make better decisions.
d. Entropy to Encourage Exploration
PPO incorporates an entropy term in the objective function to encourage the agent to explore more of the environment rather than prematurely converging to a suboptimal solution. The entropy term increases the uncertainty in the agent’s decision-making, which prevents overfitting to a specific strategy:
Where H(πθ) represents the entropy of the policy. By adding this term, PPO ensures that the agent does not converge too quickly and is encouraged to continue exploring different actions and strategies, improving overall learning efficiency.
Conclusion
The mathematical underpinnings of PPO demonstrate how this algorithm achieves stable and efficient learning. With concepts like the probability ratio, clipping, advantage estimation, and entropy, PPO offers a powerful balance between exploration and exploitation. These features make it a robust choice for both researchers and practitioners working in complex environments. The simplicity of PPO, combined with its efficiency and effectiveness, makes it a popular and valuable algorithm in reinforcement learning.
Optimizing costs of generative AI applications on AWS is critical for realizing the full potential of this transformative technology. The post outlines key cost optimization pillars, including model selection and customization, token usage, inference pricing plans, and vector database considerations.
A year-end summary for junior-level MLE interview preparation
Job-seeking is hard!
In today’s market, job-seeking for machine learning-related roles is more complex than ever. Even though public reports claim that the job demand for machine learning engineers (MLE) is fast growing, the fact is that the market has turned toward an employer’s market over the past few years. Finding an ML job in 2020, 2022, and 2024 could be completely different experiences. What’s more, a few factors contribute to the disparities of job-seeking difficulties across geography, domain, as well as seniority level:
Geography: According to the People in AI report, the top cities hiring in North America in 2024 are the Bay Area, NYC, Seattle, etc. If we use the ratio between professionals# and post# to evaluate the success rate of finding a job, then the success rate in the Bay Area is 3.6%. However, if you live in LA or Toronto, the demand is much lower, which causes the success rate to drop to 1.4%, only 40% compared to the Bay Area. The success rate could be even lower if you live in other cities.
Domain: The skill sets needed for ML Engineer roles vary widely for each domain. Take deep learning models as an example; CV usually uses models like ResNet, Yolo, etc., while NLP involves understanding RNN, LSTM, GRU, and Transformers; Fraud Detection uses SpinalNet; LLM focuses on the knowledge of Llama and GPT; Recommendation System consists of the understanding of Word2Vec and Item2Vec. However, not all domains are hiring the same number of ML Engineers. If we search the tags in the system design case studies from Evidently AI, the tag CV corresponds to 30 use cases, Fraud Detection corresponds to 29, NLP corresponds to 48, LLM corresponds to 81 and Recommendation System corresponds to 82. The Recommendation System has almost 2.7 times more use cases than the CV. This ratio might be highly biased and not truly reflect the actual situation in the job market. Still, it shows the likelihood of more opportunities in the Recommendation System for ML Engineers.
Seniority level. According to the 365 DataScience report, although 72% of the postings don’t explicitly state the YOE required, engineers with 2–4 YOE are in the highest demand. This means you’ll likely face more difficulty getting an entry-level job offer. (The reason that most demand is on engineers with 2–4 YOE could be explained by the fact that MLE was not a typical role five years ago, as explained in this blog post.)
This article will summarize the materials and strategies for MLE interview preparation. But please remember, this is just an empirical list of information I gathered, which might or might not work for your background or upcoming interviews. Hopefully, this article will shed some light or guidance on your career-advancing journey.
The interview journey could be extended, painful and lonely. When you start applying for jobs, there are things you should think and plan accordingly:
Interview timeline
Types of roles
Types of companies
Domain
Location
Interview timeline. The timeline for each company is different. For companies of smaller sizes (<500), usually in pre-seed or series A/B, the timeline is generally faster, and you can expect to finish the application process within a few weeks. However, for companies of larger sizes (> 10k or FAANG), from application submission to the final offer stage, it can vary from 3–6 months, if not longer.
Types of roles. I would refer to Chip Huyen’s Machine Learning Interview book for a more detailed discussion of different ML-related roles. The role of MLE could come under different names, such as machine learning engineer, machine learning scientist, deep learning engineer, machine learning developer, applied machine learning scientist, data scientist, etc. At the end of the day, a typical machine learning engineer role is end-to-end, which means you’ll start by talking to product managers (sometimes to customers) and defining the ML problem, preparing the dataset, designing and training the model, defining the evaluation metrics, and serving and scaling the model, and keeping improving the outcome. Sometimes, companies mix the titles, e.g., MLE with ML Ops. It’s the responsibilities that matter, not the titles.
Types of companies. Again, Chip Huyen’s Machine Learning Interview book discusses the differences between application and tooling companies, large companies and startups, and B2B and B2C companies. Moreover, it’s worth considering whether the company is public or private, whether it’s sales-driven or product-oriented. These concepts should not be overlooked, especially if you’re looking for your first industry job, as they will build your “lens of career,” which we’ll discuss later in the interview strategy section.
Domain. As mentioned in the introduction section, the jobs of MLE could fall into different domains like recommendation systems and LLMs, and you need to spend time preparing for the fundamental knowledge. You need to identify one or two domains you’re most interested in to maximize your chances; however, preparing for all different domains is almost impossible as it will disperse your energy and attention and get you under-prepared.
Location. Beyond all the points above, location is a serious matter. Looking for MLE jobs will be even more difficult unless you live in high-demand areas like the Bay Area or NYC. If relocating is impossible, you probably need to plan for a longer timeline to get a satisfying job offer; however, if you leave the relocating option open, applying to opportunities in high-demanding areas is probably a good idea.
During the Interview — How to Prepare?
Once you start the application process and start to get interviews, there are a few things you need to search and prepare for:
Interview format
Referrals, networking
LinkedIn or Portfolio
Interview resources and materials
Strategies: planning, tracking, evolving, prompting, estimating your level, wearing your “lens of career,” getting an interview partner, red flags
Accepting the offer
Interview format
The interview format varies among different companies. No two companies have the same interview format for the MLE role, so you must do your “homework” on researching the format in advance. For example, even for FAANG companies, Apple is known for its startup-style interview format, which varies from team to team. On the other hand, Meta tends to have a consistent interview format at the company level, comprising one or two leet code rounds and ML system design rounds. Usually, the recruiter would give detailed information about large companies’ interview format, so you won’t be surprised. However, the process could be less structured for smaller companies and change more frequently. Sometimes, smaller companies replace leet code with other coding questions and only lightly touch the modelling part instead of having an entire ML system design session. You should search for information on free websites like Prepfully, Glassdoor, Interview Query, or other paid websites for a comprehensive understanding of the interview format and process to prepare better in advance. Lastly, don’t be limited by interview format as it’s not a standard test — there could be behaviour elements during technical interviews and technical questions during your hiring manager round. Be prepared, but be flexible and ready to be surprised.
Referrals and networking
Many web articles would exaggerate the benefit of referrals, but having a referral is just a shorter path for you to get past the recruiter round and land directly to the second round (usually the hiring manager round). Besides having referrals, it’s almost equivalently essential to network in person, e.g., use hackathon opportunities to talk to companies, go to in-person job fairs, and participate in offline volunteer events sponsored by companies you’re interested in. Please don’t rely on referrals or networking to get a job, but use them as opportunities to increase your probability of getting more conversations from recruiters and hiring managers to maximize your interview efficiency.
LinkedIn or Portfolio. LinkedIn and Portfolio are just advertising tools that help recruiters understand who you are beyond the textual information in your resume. As a junior MLE, it would help to include course projects and Kaggle challenges in your GitHub repository to show more relevant experience; however, at you get more senior level, toy projects make less sense, but PR in large-scale open source projects, insightful articles and analysis, tutorials on SOTA research or toolboxes, will make you stand out from the rest of the candidates.
Interview resources and materials
Generally speaking, you need materials covering the five domains: i) coding, ii) behaviour, iii) ML/Deep learning fundamentals, iv) ML system design, and v) a general MLE interview advice book.
i) Coding. If you’re not a Leet Code expert, then I would recommend starting with the following resources:
You also need to have a handful of interview partners — these days, you can subscribe to online interview preparation services (don’t use the costly ones which charge you thousands of dollars; there are always cheaper replacements) and pair up with other MLE candidates for skill and information exchange.
Strategies: planning, tracking, evolving, prompting, interview for one level up, wearing your “lens of career,” getting an interview partner, red flags
Planning, tracking, and evolving. Ideally, as described in this article, you should get at least a handful of recruiter calls and categorize your interviews to different interest levels. For one thing, the job market is constantly changing, and someone can rarely plan for the best strategy in the first interview. For the other thing, you’ll learn and grow during the interview process, so you’ll become different from where you were a few months ago at the beginning of the job-seeking stage. So, even if you’re the most talented candidate on the market, it’s essential to spread out your conversations over a few months and start with the conversations that you’re least interested in to familiarize yourself with the market and sharpen your interview skills, and leave the most important ones to the later stage. Track your progress, feedback, and thoughts during your interview process. Set specific learning goals and evolve with your interviews. You might never have had the chance to touch on GenAI knowledge in the past few years, but you could utilize the interview process to learn from online courses and build small side projects. The best thing is to get a job after the interview, and the second best thing is to learn something useful even if you don’t get the job offer. If you keep learning from every interview, eventually, it will vastly increase your chances of getting the next job offer.
Prompting. This is the age of LLM, and you should utilize it wisely. Look for the keywords in the job descriptions or responsibilities. If there is an interview involving “software engineering principles,” then you can prompt your favourite LLM to give you a list of software engineering principles for machine learning for preparation purposes. Again, the prompt answer shouldn’t be your sole source of knowledge, but it can compensate for some blind spots from your daily reading sources.
Interview for one level up. Sometimes, the boundaries between levels are blurry. Unless you’re an absolute beginner in this field, you can always try the opportunities that are one level above and prepare for the down level at the job offer stage. If you’re interviewing for senior level, preparing or applying to staff-level opportunities doesn’t hurt. It doesn’t always work, but sometimes it can open doors for you.
Wear your “lens of career”. Don’t just go to an interview without thinking about your career. Unless you desperately need this job for a specific reason, ask yourself, where does this job fit into your overall career map? This question matters from two perspectives: first, it helps you choose the company that you want to go to, e.g., one startup might offer higher salaries in the short run, but if it doesn’t prioritize sound software engineering principles, then you’ll lose the opportunity to grow into a promising career in the long run; second, it helps to diagnose the outcome of the interview, e.g., your rejections are mostly from startups, but eventually you landed in offers from well-known listed companies, then you’ll realize the rejections don’t mean you’re not a qualified MLE, but because interviewing at startups require different skills and those don’t belong to your career path.
Partner up. Five years ago, there was no such thing as finding an interview partner. But these days, there are interview services all over the internet ranging from extremely high cost (which I don’t recommend) to a few hundred dollars. Remember, it’s a constantly shifting market, so nobody knows the whole picture. The best way to gain information is to partner with your non-competitive peers (e.g., you’re in the CV domain, and your partner is in the recommendation system domain) to practice and improve together. Better than just partnering, you should seek to partner up — look for people with a higher seniority level while you can still offer something useful for them. You might ask, how is it possible? Why would someone more senior than me want to practice with me together? Remember, nobody is perfect, and you can consistently offer others something. There are senior software engineers who would like to become MLE, and you can trade your ML knowledge for their software engineering best practices. There are product managers who need ML-related input, and you can ask for behavioural practice in return. Even for people with no industry experience at the entry-level, you can still ask for coding practice in return or listen to their life stories and get inspired. As an MLE, especially at the senior/staff level, you need to demonstrate leadership skills, and the best leadership skill you can demonstrate is to collect the professionals at different levels to help achieve the goal you’re chasing after — your dream offer.
Red flags. Some red flags, like asking you to overwork directly or ghosting the interviews, are explicit. However, some red flags are more subtle or deliberately disguised. For example, your hiring manager might politely explain their situation and wish you “didn’t have high expectations at the beginning and decide to leave in a few months” — it sounds so considerate. Still, it shadows the fact that the turnover is high. The best strategy for avoiding red flags involves reading Glassdoor reviews and learning about company culture during the interview. Specifically, “culture” doesn’t mean the “culture claims” defined on the company website but the actual dynamics between you and the team. Are the interviewers only asking prepared questions without trying to understand your problem-solving skills? When you throw a question, can the interviewer catch that question and give an answer that helps you to understand the company’s value better? Lastly, always remember to use your gut feelings and decide whether you like your future team. After all, if you decide to take the job offer, you’re facing these people eight hours per day for the next few years; if your gut feeling tells you that you don’t like them, then it won’t be happy anyway.
Accepting the offer. Once you’re done with all the frustration, all the disappointments, and all the hard work, it’s time to talk about the offer. Many web threads discuss the necessity of negotiating the offer, but I suggest being cautious, especially in this employer’s market. If you want to negotiate, the best practice is to have two comparable offers and prepare for the worst case. Also, websites like Levels.fyi and Glassdoor should be used to research the compensation range.
After the Interview — What to Do Next?
Congratulation! Now you’ve accepted your offer and are ready to start your new journey, but is there anything else you can do?
Summarize your interview process. The interview journey was long and painful, but you also gained a lot of things from the journey! Now you’re relaxed and happy, and it’s the best opportunity to reflect, be grateful to the people who helped you during the interview, and share some of the information with those still struggling with their interview process. Besides, you must have collected a lot of notes and to-do lists during the past few months but never had time to sit down and organize them, and now is the best opportunity to do that!
Plan your career path. Your self-understanding could have changed during the past few months; now, you’ve better understood your learning ability and problem-solving ability under pressure. You have talked to so many startups and big techs and start to have a better picture of where you are in the next five years. If you’re at the mid-to-senior level, then you have probably talked to many people at the staff level and gotten a better idea of what you will work on in the next stage of your career. This is the time for all this planning!
Keep learning. If you’re from an academic background, you’re probably used to reading papers and learning about SOTA ML techniques. However, the role of MLE is more than that of the academy. It combines research, applied ML practices and software engineering to make a real business impact. Now is a good time to think about the best strategy to keep learning from multiple sources to keep yourself up-to-date.
Get ready for the new role. You have talked to the other MLEs in your new company and know what models or tech stack they’re using. In sporadic cases, you already know this tech stack very well, but most times, you need to learn many new things for the new role. Make a plan for how you will learn them and set small milestones to achieve the goals. Besides, learn about your new company, explore its homepage and understand its business goals. this will help to set a good tone when you start the new job and talk to your new colleagues.
After all, the interview journey is different for everyone. Your level of experience, focus on the domain, long-term career goal, and personality all form a unique interview journey. Hopefully, this article will shed some light on the interview preparation materials and strategies. And I hope everyone will eventually land on their dream offer!
Acknowledgement: Special thanks to Ben Cardoen and Rostam Shirani for proofreading and insightful suggestions that contributed to the final version of this article.
Peng Shao, “Inside the Machine Learning Interview: 151 Real Questions from FAANG and How to Answer Them”, 2023
Chip Huyen, “Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications,” 2022
Ali Aminian, Alex Xu, “Machine Learning System Design Interview,” 2023
Christopher M. Bishop, Hugh Bishop, “Deep Learning: Foundations and Concepts,” 2023
Tanya Reilly, “The Staff Engineer’s Path: A Guide for Individual Contributors Navigating Growth and Change,” 2022
Melia Stevanovic, “Behavioral Interviews for Software Engineers: All the Must-Know Questions With Proven Strategies and Answers That Will Get You the Job,” 2023
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.