Learning How (And When) To Use List Comprehensions For Data Cleaning Using Python

Here’s what I’m thinking these days:

Beyond so many other definitions and/or descriptions, data science is a tool.

But, it’s also a toolbox filled with tools.

One of these tools is Python.

It’s one of the best tools to have in one’s data science toolbox.

But, Python is also both tool and toolbox.

And even though my data science learning journey is still just beginning, I’m realizing how important one of these particular tools is which can be found inside the (well-equipped) Python toolbox.

That tool is a list comprehension.

In this post, I’ll discuss what I’m learning about how (and when) to use list comprehensions for data analysis using Python.

If you’re at the beginning of a data science journey too, or if you’ve had trouble understanding list comprehensions well, hopefully this post will help.

By the end of this post, we will have covered the following:

  • What is a list comprehension?
  • A step-by-step process to go from for loop to list comprehension
  • One question to ask to determine whether a list comprehension makes sense to use
  • Using a list comprehension for the dataframe nested dictionary problem

Sound like a plan? If so, let’s get started!

(NOTE: If you’re more curious about why I’ve started this data science journey, I encourage you to read this. Also, if you haven’t done so already, I’d appreciate if you headed here to sign the petition regarding justice for Breonna Taylor, who was killed in Kentucky in March of this year, when officers entered the wrong apartment while leveraging the controversial “no knock” warrant and firing 20 shots. She was a beautiful young black woman, and an award-winning EMT).

What is a List Comprehension?

OK, let’s begin this way:

Let’s say you want to keep track of the sandwiches you make each day. Here are the first three sandwiches you made, and ate, 6 months ago:

#Creating variables for three sandwiches, and storing the #ingredients as a listsandwich_one = ['bread','vegan provolone cheese','tomato','spinach','red onion','black bean patty','yellow mustard']sandwich_two = ['barbecue sauce','vegan provolone cheese','portobello mushroom','bread','pulled jack fruit']sandwich_three = ['red onion','tomato','vegan mayonnaise','bread','spicy fake chikn patty','vegan mozzarella cheese','spinach']

You want to set things up so you can easily figure out all the details about each sandwich you’ve had, even perhaps 6 months later.

So, you’ve stored the sandwich details in lists, one by one.

Now, let’s say today you want to remember the “sandwich breakdown” for sandwich number two. You could just scroll through the hundreds of sandwich list you’ve made (6 months’ worth of sandwiches, with let’s say 30 days a month, and that’s 180 sandwich variables with lists), or you could use a for loop:

#Creating a for loop to check inside sandwich two for ingredients.for i in sandwich_two:
ingredients = i
print(ingredients)

You could even do better by nesting this for loop inside of a function:

#Storing the for loop inside of a function for more efficiencydef check_sandwich (x):
for i in x:
ingredients = i
print(ingredients)

With the initial for loop you would have to write it out each time you wanted to check on a sandwich. With the function you can now check on any sandwich you want without writing the code all over again. You can simply replace x with the specific sandwich variable and then the loop would run.

#Typing the function anytime, and inputting the variable, or #sandwich that you're interested in.check_sandwich(sandwich_three)

Hopefully you would agree that nesting the for loop inside of a function provides an increased level of efficiency, yes?

Well, this is one of the benefits of using a list comprehension: better efficiency

But wait — what can a list comprehension make more efficient?

Answer: for loops!

#The list comprehension below replaces the for loop.
#We go from 3 lines of code to 1 line of code.
def check_sandwich (x): ingredients = [i for i in x]
print(ingredients)
#Running the function to check on sandwich threecheck_sandwich(sandwich_three)

So, in essence, a list comprehension is a more concise way to define and create lists from existing lists.

Here’s the general syntax for writing a list comprehension:

new_list = [expression(item) for item in old_list if conditional(item)]

Did you notice what was added?

I added an if statement to the list comprehension.

For loops can often include (and perhaps sometimes you’ll need) a conditional statement in order to satisfy the operation you’d like to perform.

So, let’s say you have a for loop and you’d like to transform it into a list comprehension.

What steps do you take?

A step-by-step process to go from “for loop” to list comprehension

Here are the steps I’d recommend:

old_list = ['cat','car','keep','corp','key','cool','kobe','coo']
new_list = []
for i in old_list:
if len(i) > 3:
new_list.append(i)
print(new_list)

— STEP 1: Copy & paste the new_list variable

new_list =

— STEP 2: Add output expression to beginning of list comprehension

new_list = [i]

— STEP 3: Add in the for statement (but without the ‘:’ mark):

new_list = [i for i in old_list]

— STEP 4: Now let’s add the conditional, without the ‘:’ as well:

new_list = [i for i in old_list if len(i) > 3]

And so now, we’ve gone from this:

new_list = []
for i in old_list:
if len(i) > 3:
new_list.append(i)
print(new_list)

To this:

new_list = [i for i in old_list if len(i) > 3]

And, essentially, this is the power potential of a list comprehension.

What’s more, a list comprehension can identify when it needs to operate on a string or a tuple, and it will treat either as a list. Consider how valuable this becomes when dealing with a huge amount of code.

Hopefully, by now you know understand what a list comprehension is.

There’s one big thing to remember, however:

Every list comprehension can be rewritten as a for loop, but not every for loop can be rewritten as a list comprehension.

This is something that would require an entire post on its own I imagine (maybe it’s worth writing one?). But, if you’re curious about this, these days I’m using this site to better internalize when to use a LC and when not to.

One thing I’m learning is this:

If you don’t actually need a list for the output expression, it probably doesn’t make much sense to do a list comprehension.

So, one question to ask when considering whether to use a list comprehension is, simply:

Do I want / need to create a new list from an existing list?

Having said that, let’s move on to an actual situation, with a bit more complexity, and look into how useful a list comprehension can be.

Using a list comprehension for the “dataframe nested dictionary” problem

On the Pandas documentation website, a dataframe is described as:

a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

One of the tasks that we’re sure to face when cleaning or organizing data while using Python is dealing effectively (and efficiently) with nested dictionaries in a dataframe.

So let’s look at one.

Take a look at a dataframe from a project I’m currently working on:

I’m currently analyzing movies with TMDb, using an API, to gain insights into things such as movie popularity, earnings and more.

As we can see, the data in the belongs_to_collection column is stored as a dictionary. Each row has a dictionary nested inside of the column.

So, if I wanted to create a list of only the names in the belongs_to_collection column and store these names in a variable called btc_list, what could I do?

Well, since there’s an existing list and I’d like to create a new list, let’s see if a list comprehension works well.

How would you write it?

Recall the steps:

  1. Copy & paste the new_list variable
  2. Add output expression to beginning of list comprehension
  3. Add in the for statement (but without the ‘:’ mark)
  4. Add the conditional, without the ‘:’ as well
btc_list = [i['name'] for i in practice_df['belongs_to_collection'] if i != None]

And running this code, I got this:

Now, perhaps someone may be asking:

Why did you add the “if i != None” at the end?

And just in case you are, I’ll explain.

Without that conditional, here’s the output I get:

What does this mean?

The best way I can explain this is that the rows that have “None” don’t actually contain “subscriptable” objects a list, tuple, string, dictionary, etc. — and therefore isn’t able to iterate over it.

The good thing is that I was able to quite easily just remove the conditional from the end of a single line of code.

And, with that one line of code I was able to iterate over 3000 rows of a dataframe, and store values from within a nested dictionary into a list.

Now, imagine you had hundreds of columns like these, with nested dictionaries and some of which had “NoneType” objects.

(We’d probably need to leverage functions then, too)!

The more I get immersed in Python and data science in general the more I understand why list comprehensions are so useful.

But a tool is only as useful as one’s ability to use it, right?

So, hopefully this has helped you!

The data science journey continues.

If there’s anything I missed or got wrong, drop a note in the comments.

Or, have a cool way to explain something I mentioned in this post?

Please do share as well.

For a really good resource on Python List Comprehensions, I’d check here!

Copywriter & Website specialist | practicing ma’at & meditation | Student of data science & decentralization

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store