Supercharged Pandas: Tracing Dependencies with a Novel Approach
An object-oriented approach to manage multiple files and dataframes, and tracing dependencies.
How will this benefit you:
This article describes an object-oriented approach for data analysis. It outlines 2 novel approaches: (a) reduce repetitive file reads by assigning dataframes to the attributes of the Reports object, and (b) trace dependent methods recursively to construct attributes. These approaches have allowed me to be highly productive in what I do and I hope that you will reap similar benefits.
Who should read this:
- You have to analyze the same data set over a long period of time.
- You need to build reports by combining data from different sources and prepare statistics.
- You have co-workers who tend to ask you, “How did you arrive at this data?” and you cannot recall the N steps in Excel that you took to prepare the report.
- You have been using pandas for a while and you suspect that there is a more efficient way of doing things.
What’s in this article?
- Monolithic script: how it begins
- Reusable functions: how it progresses
- Objects, methods and attributes: how it evolves
- Tracing upstream dependencies: a novel approach
Preamble
It is rather difficult to explain what I am trying to do, so please bear with me if the first half of this article doesn’t make sense. I promise that towards the end, it will be all worth it.
Monolithic script: how it begins
Suppose you have 3 csv files: file1.csv, file2.csv, file3.csv. You write some code to read each one of them, and then merge them in a particular order.
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
df3 = pd.read_csv('file3.csv')
df1_2 = pd.merge(df1, df2, on='a', how='left')
df1_2_3 = pd.merge(df1_2, df3, on='b', how='inner')
This works perfect, and you get on with life. Next, your boss gives you file4.csv which is supposed to be merged with file1.csv to build a separate report. No issues, you know the drill, you update the code:
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
df3 = pd.read_csv('file3.csv')
df4 = pd.read_csv('file4.csv')
df1_2 = pd.merge(df1, df2, on='a', how='left')
df1_2_3 = pd.merge(df1_2, df3, on='b', how='inner')
df1_4 = pd.merge(df1, df4, on='a', how='left')
The code runs smoothly and you get the desired output. Boss pats you on the back and jokingly says, “That’s quick, can you be even faster?”
You look up at your boss, but all you can see is a train of expletives flashing across your eyes. You fight the visceral urge to pick one to be processed by your biological audio output device. You triumphed in doing so and summoned all the hypocritical chakra to fake a smile and respond cheerfully, “Sure, let me give it a shot.”
As the train of expletives fades into the horizon and as you exhaust all your chakra, you noticed a glimmer of hope: there is no need to read file2.csv and file3.csv if you are only interested in df1_4. It dawn upon you that this flagrant expenditure of precious time and computing power, contradicts with your commitment towards sustainability and you begin to contemplate how to make the code more efficient by reading only what is necessary.
Reusable functions: how it progresses
You recall the programming classes that you took N years ago and proceeded to write some functions:
files = {1: 'file1.csv', 2: 'file2.csv', 3:'file3.csv', 4:'file4.csv'}
def get_df(x):
return pd.read_csv(files[x])
def get_df1_2():
df1 = get_df(1)
df2 = get_df(2)
return pd.merge(df1, df2, on='a', how='left')
def get_df1_2_3():
df1_2 = get_df1_2()
df3 = get_df(3)
return pd.merge(df1_2, df3, on='b', how='inner')
def get_df1_4():
df1 = get_df(1)
df4 = get_df(4)
return pd.merge(df1, df4, on='a', how='left')
You are pleased with yourself. Although the number of lines of code has more than doubled, you take comfort in the fact that it will be more manageable in the long run. Also, you justify this approach because you can get specific output dataframes and each one of them will only read the required tables and nothing else. You feel a chill down your spine, as an inner voice challenges your conscious thoughts. “Are you sure?” he barked in a commanding tone, reminiscent of a drill sergeant. Silence hung densely in the air, and all you can hear is the spinning of the imaginary cogs in your mind… Suddenly, your eyes lit up and noticed that if you need df1_2 and df1_4, then file1.csv will be read twice! Roar!
Objects, methods and attributes: how it evolves
Once again, you recall the programming lessons in college and remembered that you can resolve this by creating a Reports object. After a dataframe has been read, it can be set as an attribute of the Reports object so that it can be accessed later.
files = {1: 'file1.csv', 2: 'file2.csv', 3:'file3.csv', 4:'file4.csv'}
class Reports:
def __init__(self):
self.df1 = pd.read_csv(files[1])
def get_df1_2(self):
self.df2 = pd.read_csv(files[2])
self.df1_2 = pd.merge(self.df1, self.df2, on='a', how='left')
return self.df1_2
def get_df1_4(self):
self.df4 = pd.read_csv(files[4])
self.df1_4 = pd.merge(self.df1, self.df4, on='a', how='left')
def get_df1_2_3(self):
self.get_df1_2()
self.df3 = pd.read_csv(files[3])
self.df1_2_3 = pd.merge(self.df1_2, self.df3, on='b', how='inner')
Voila! You have solved the problem of reading the same file several times. But there is yet another problem: get_df1_2_3 can get very complicated if it has to go through many steps, e.g. filtering, selecting, boolean-masking, removal of duplicates, etc.
Tracing upstream dependencies: a novel approach
You take a deep breath and wonder… is there a way for the code to figure out that if self.df1_2 has not been set, then it should call self.get_df1_2()? More generally, when an attribute being accessed is not present, can we identify which method is responsible for setting it, and then call the method? If this can be achieved, then one can use x=Reports(); x.df1_2_3 to get to the required dataframe in one command.
Isn’t that worth fighting for? Isn’t that worth dying? — Morpheus, The Matrix Reloaded, 2003
Like a mad scientist at work, you begin hammering away at your keyboard, occasionally looking up to make imaginary drawings of programming abstractions and connecting them with your fingers. From your peripheral, you notice the look of bewilderment — or perhaps disgust, but you couldn’t tell — from a co-worker you never knew. You channel all your focus to enter flow state, oblivious to what is happening around you. The building could have caught fire, but you wouldn’t know as long as your trusty Notepad++ continues to display every key you enter.
files = {'df1': 'file1.csv', 'df2': 'file2.csv',
'df3': 'file3.csv', 'df4': 'file4.csv'}
class Reports:
def __init__(self):
self._build_shortcuts()
def _read(self, k):
setattr(self, k, pd.read_csv(files[k]))
def _build_shortcuts(self):
# Dict: Method -> list of attributes
dict0 = {'get_df1_2': ['df1_2'],
'get_df1_4': ['df1_4'],
'get_df1_2_3': ['df1_2_3']}
# Dict: Attribute -> method which creates the attribute
dict1 = {v:k for k, values in dict0.items() for v in values}
self._shortcuts = dict1
def __getattr__(self, attr):
if not attr in self.__dict__: # if the attr has not been created...
if attr in self._shortcuts:
func = self._shortcuts[attr]
# `func` is the method responsible for creating attr
self.__getattribute__(func)()
return self.__getattribute__(attr)
elif attr in files:
self._read(attr)
return self.__getattribute__(attr)
else:
raise AttributeError
else:
return self.__getattribute__(attr)
def get_df1_2(self):
self.df1_2 = pd.merge(self.df1, self.df2, on='a', how='left')
return self.df1_2
def get_df1_4(self):
self.df1_4 = pd.merge(self.df1, self.df4, on='a', how='left')
return self.df1_4
def get_df1_2_3(self):
self.df1_2_3 = pd.merge(self.df1_2, self.df3, on='b', how='inner')
return self.df1_2_3
You take a moment to admire your creation, its elegance and simplicity. For a split second, you dream about how this will benefit coders and data analysts. As you ride the hot-air balloon of euphoria, the inner voice descends upon you like shackles on a prisoner. “Stay grounded,” he said, “as you may not be the first to come up with such an idea.” You buckle down and begin documenting your work, consciously aware that you may not understand what you have written a few days later.
__init__() does not read files. It merely calls build_shortcuts().
- _build_shortcuts() & __getattr__ work hand-in-hand to simplify the code in subsequent methods.
- _build_shortcuts() takes a dictionary with methods as keys and list of attributes as values, then inverts it to form a dictionary with attributes as keys and methods which sets the attributes as values.
- __getattr__ does quite a bit of magic. When one calls self.<attr>, if attr is not present in self.__dict__ but is in self._shortcuts, then it identifies the method that is responsible for creating self.<attr> and calls the method. Whenever you create a new method, if it sets a new attribute, then all you have to do is to update dict0 in self._build_shortcuts(). If it is in the keys of the files dictionary, then it reads the corresponding file and sets the key as the attribute of the Reports object.
- Without explicitly writing a loop or recursion, __getattr__ and self._shortcuts work together to trace the upstream dependencies!
For now, this is a superior approach for the following reasons:
- Files are read only when absolutely required, minimal time wasted.
- When files are read, they are read only once, and data written to the attribute.
- When calling an attribute, if it is not created, it will find the method responsible for setting the attribute, and then set it.
Additional benefit
Besides being able to access the desired dataframes in one command, you can also add other attributes[1] to the values of dict0 in _build_shortcuts().
For example, you may be interested in the list of unique values of column a in df1_2. Simply add it to the list, and you can use x = Reports(); x.unique_values_in_a.
...
def _build_shortcuts(self):
# Added 'unique_items_in_a' to the first list.
dict0 = {'get_df1_2': ['df1_2', 'unique_values_in_a'],
'get_df1_4': ['df1_4'],
'get_df1_2_3': ['df1_2_3']}
dict1 = {v:k for k, values in dict0.items() for v in values}
self._shortcuts = dict1
...
def get_df1_2(self):
self.df1_2 = pd.merge(self.df1, self.df2, on='a', how='left')
# Added the list of unique values of column 'a'
self.unique_values_in_a = self.df1_2['a'].unique().tolist()
return self.df1_2
Conclusion
What does it mean for you?
I highly encourage you to try this approach the next time you are required to analyze data which involving multiple dataframes.
- For python novices, you can just copy-and-paste the Reports class, __init__, __getattr__ and _build_shortcuts method. Obviously, you will need to write your own methods and updatedict0 in _build_shortcuts.
- For python experts, I would love to hear your view on my approach and if you are doing something similar, or better.
Disclaimer
This narrative is merely for illustrative purposes and does not in any way shape or form represent my or my firm’s views or reflect in any way experiences in my firm or with my clients.
This is the first time that I’ve used such a writing style, if you like it, do show your appreciation by clapping, following and subscribing. Thank you!
[1] Many thanks to Tongwei for proofreading and suggesting this.
Supercharged pandas: Tracing dependencies with a novel approach was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Supercharged pandas: Tracing dependencies with a novel approach
Go Here to Read this Fast! Supercharged pandas: Tracing dependencies with a novel approach