A story about the creation of a Simulation Optimization model to help streamline support desk staffing.
Introduction
It started as a fairly simple request. My friend, who helps run and operate a support center, was having some difficulties. At any given time, support desk agents seemed to be inefficiently optimized, either overstaffed or understaffed. He had decently clean data about when calls (phone calls, chat messages, and emails) came in, how long callers waited on hold and how long the conversation lasted. Knowing my background in Operations Research (OR), my friend introduced this problem to me. I was excited — in my career, I don’t regularly use OR principles. This was an opportunity to get my feet wet with something I dedicated five years of my life to studying.
Core to any OR program is queueing theory, and this was a textbook case (literally) of an m/m/c queue. Calls come in at a set stochastic rate, a number of agents handle the incoming calls, all calls are handled at a set stochastic rate and stochastic rates should follow an exponential distribution. So I fired up Jupyter Notebook and used scipy to fit exponential distributions to some of the incoming call and handle times. I found that, indeed, we did have parameters that matched pretty closely to exponential distributions.
Now, if my friend simply wanted to know how many agents he should have staffing his support desk, there are formulas to tell you exactly that; chiefly Erlang-C. [1] However, like any real world scenario, that formula broke down quickly when we started to introduce the multitude of parameters that impacted this specific support center. To name a few: variable incoming call demand throughout the day, agent efficiency, breaks, agents needing to work on shifts, the list goes on. So that brought me to simulation, another core tenant of OR.
Simulation
Simulation is useful because you are not constrained to working in specific formulas. You can use the properties of stochastic systems against themselves, knowing that any system with stochastic inputs will converge to a set output — given that you run the system enough times to get a confident average. Most weirdness in the system and data can be accounted for by adjusting parts of the simulation and matching the output to the real world training data.
So I got to work. Over the next couple of days, in Python, I built out a simulation of the support desk using the real data inputs. The core parts of my simulation included accounting for variable incoming call times and agent handle times throughout the day, while still accounting for components like the length of the queue and agents working when parameters shifted. It was time to decide what we were going to measure, of which there were a multitude of options. I started by measuring agent utilization and how long the queue was on average, however one of the main metrics for support centers is the connection time Service Level Agreement (SLA). [2] SLA is in essence an agreed upon target measurement of how long people should wait in the queue. SLA is important to monitor not just as an average, but as a time series throughout the day. Support centers want to have calls answered in a short, predictable amount of time, no matter when someone is calling.
Luckily this was pretty easy. Anytime I took a measurement during a simulation run I would check how long the first person in line has been waiting. Knowing that the first person in line has been waiting the longest, we can know that the maximum SLA is still maintained. We made our training and validation sets with his data, made some more tweaks, and my friend had a tool he could use to change the number of agents, incoming call parameters, and see what would happen to SLA. Not to be discounted, this part of the project took a lot of the time.
Optimization
During a road trip we discussed how this tool was interesting but not necessarily useful. My friend could plug in numbers and see what happened, but that didn’t help make staffing decisions any easier, as it was basically educated trial and error. What would be more helpful was a system where you could set a target SLA for throughout the day and the system would give you the best agent schedule to match that target, given variable incoming call parameters and agent handle times. That is a third great pillar of OR: optimization.
The issue was that with optimization you needed a formula to optimize, and we had a simulation. However, it dawned on me that there was no reason we couldn’t use the simulation as the objective function for an optimization problem. It takes a set of inputs, and with enough runs of the simulation, gives you a confident output. Luckily, 30 years ago researchers had the same thought and have been studying the field of Simulation Optimization ever since. These smart people had done all the hard work for me. [3, 4]
So, again, I got to work. Now the core of Simulation Optimization is creating a set of inputs, running it on your simulation, and choosing the next set of inputs in an intelligent way until you reach your objective. My optimization problem looked something like this:
My objective is to build a schedule. Schedules are made out of shifts. Shifts are set times throughout the simulation “day” (say 9am — 5pm, 10am — 4pm) during which an agent can work. A schedule is the set of all shifts, with how many agents are working on each shift. A good schedule is one that keeps the SLA hovering right around the target and minimizes the total number of worker hours (number of workers * length of their shifts). Certain shifts have a cap with how many people can work (ex. only 10 people can work the half day morning shift) and each shift has a certain break schedule. The SLA should never cross a hard ceiling above the target (ex. our target is an SLA of 3 min but no one should ever wait more than 10 min).
I set up the framework so all of these constraints were accounted for, and now I was ready to optimize. Some problems talked about in Simulation Optimization research are difficult because their simulations are black boxes, meaning you don’t get any useful signals besides the simulation answer. Luckily, this simulation is different. We are measuring the time-in-queue throughout the simulation “day” so we can see how certain shifts do in comparison to each other.
For example, if the morning shift has more agents than the late shift but the late shift has more incoming calls, we will likely see that the late shift is more off target than the morning shift. In our next attempt we should probably increase the number of people on the late shift.
It’s not as simple as this — with complicated overlapping shifts and other parameters there will be interdependency, but we can plug any schedule into the simulation and see what happens.
These signals that we get from the simulation led me to a clear strategy, gradients.
Gradient Based Search can be understood as a ball rolling down a hill, where with every step you are trying to get the ball to a lower state until it reaches the bottom. This is complicated by the fact that it is not an even hill. There will be false bottoms where any direction you pick goes back up the hill, but the true bottom requires you to start rolling from a different location. In practice, we’ll never know if we have reached the true bottom, but there are smart ways to make sure we do enough tests to know that we are pretty close. In our case the ball is a schedule, and the bottom of the hill is the perfect line where all incoming calls wait for exactly the target SLA.
So, what is the most natural gradient step? Take the worst schedule (the one on average farthest away from the target) and add an agent to the shift if the average is above your target, or take away an agent if the average is below the target. If you are stuck where all shift changes make the schedule worse but the SLA is not close to the target, go back to a previously good schedule and start from there again. I tried other methods of picking a next schedule but this was the most consistent. How do you know when to stop? Set a tolerance to say that if every shift is on average less than 30 seconds (for example) from the target, consider this solved. Stop if too many steps have been taken, and pick the best schedule from the ones simulated.
In an ideal world we’d run this program forever, trying every single schedule until we find the best, but we needed this program to pick a schedule in under 5 min. The most interesting part of this project was testing optimization algorithms to find the one that can get a good solution the quickest.
The solution with the best results used the fact that you can control the number of times you run a simulation, the trade off being that the fewer times you run a simulation, the less confident you are in its results. Here’s what I found to work:
Start off with fewer simulations when beginning to run the optimization algorithm, and as the objective approaches, increase the number. This works because in the beginning many shift changes will put the schedule in the direction of the objective, so it’s not as important to be as confident in the direction you chose. As a good solution inches closer, it matters more where the schedule goes. Combine this with layered runs all starting from the same null point to ensure a wider breadth of explored solutions.
What’s great about this approach is that I am able to control how long the algorithm runs. I can see how long simulation runs take, and adjust the number of iterations accordingly, I can also pick how many times I begin from the starting spot. Every machine you run this algorithm on will operate a little differently, and consistent algorithm run time is of paramount importance to make sure this project is useful in the real world.
#schedule is {(start_time, end_time): num_agents}
{'Schedule':
{(0, 540): 5,
(30, 570): 2,
(60, 600): 1,
(90, 630): 1,
(120, 660): 4,
(150, 690): 1,
(180, 720): 0,
(210, 750): 1,
(240, 780): 18,
(0, 660): 0,
(30, 690): 0,
(60, 720): 0,
(90, 750): 0,
(120, 780): 3}
'Average Wait': 0.5738814985851588
'Worker Units': 660.0
'Worst Shift Time In Queue, Relative to Target': 0.5965600329117189}
Conclusion
Now my friend has a program that should be able to build a schedule for him given his inputs and constraints. It has yet to be seen how well this actually performs. We will have to wait to gather enough data on the schedules and simulation accuracy. There will be real world factors that we didn’t account for, like people getting sick or going on vacation. But the great part about a simulation is that anything not accounted for we can add later. It won’t require a massive rework of the whole project. All we have to do is slot the change in where needed.
One of the most important parts of this project is that everything works independently and is very modular. Any irregularity we can model and build into the simulation. So many homework projects never make it out of the classroom because of the rigidity of the solution. This project is, in my mind, particularly useful because of the flexibility and ability to rework. Things change in the real world, and technology solutions should be able to adapt to remain relevant.
Final Thoughts
An incredibly huge thank you to Jeremy Harper, the friend in this project.
This project has piqued my interest in this field, and I would love to try it again with a different problem. If you have a real world problem that needs to be modeled, simulated, and optimized I would love to hear about it, and potentially collaborate on a further project in this space. My LinkedIn.
References
1. Rahul Awati, Earlang C, techtarget.com
2. Naveen Mahadevan, Service Level Agreement (2022), sprinklr.com
3. N. Jian, S. Henderson, An Introduction to Simulation Optimization (2015), Winter Simulation Conference
4. Y. Carlson, A. Maria, Simulation Optimization: Methods and Application (1997), Winter Simulation Conference
Simulation Optimization: Helping My Friend Model and Optimize His Company’s Support Desk was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Simulation Optimization: Helping My Friend Model and Optimize His Company’s Support Desk