What are Dataframes?

Have you ever wished you could wield the power of a spreadsheet within your Python code? Well, with DataFrames, you can! Forget clunky lists and arrays – DataFrames are the Swiss Army knives of data organization in Python.

Imagine a high-tech spreadsheet on steroids. That's essentially a DataFrame. It holds your data in a grid-like structure, with rows and columns just like Excel. But here's the magic: each column can hold a different kind of data, like numbers, text, or even dates. This flexibility makes DataFrames perfect for wrangling all sorts of information in your Python programs.

Think of it this way: you have a list of friends with their names, ages, and favorite hobbies. A basic list might work, but it gets messy when you want to find someone by age or hobby. With a DataFrame, each friend is a row, and their name, age, and hobby are stored in separate columns. Now, searching and sorting becomes a breeze!

DataFrames are the heart of the popular Pandas library, a powerful toolbox for data manipulation in Python. With a few lines of code, you can filter data, perform calculations on entire columns, and even visualize your data in charts and graphs. It's like having a built-in data analysis team at your fingertips!

So, the next time you're working with data in Python, ditch the plain old lists and arrays. Embrace the power and flexibility of DataFrames – they'll transform the way you handle information in your code.

The Pandas Library

Pandas is a powerful Python library specifically designed for data analysis and manipulation. It offers a wide range of functionalities but for this article, we'll focus on its CRUD (Create, Read, Update, Delete) capabilities applied to a life expectancy dataset.

CRUD Operations:
- Create: Pandas allows you to construct new DataFrames from various data sources.
- Read: Data can be imported into DataFrames from CSV files, databases, or other formats.
- Update: Existing data within a DataFrame can be modified or corrected.
- Delete: Unnecessary data points or rows can be removed from the DataFrame.

Life Expectancy Dataset:

This article will demonstrate how to utilize Pandas' CRUD operations on a dataset containing life expectancy information. We will guide you through:

Creating a DataFrame from the life expectancy data.
Reading and exploring the data within the DataFrame.
Updating specific data points if required.
Deleting irrelevant data for a more focused analysis.

By following these steps, you'll gain a solid understanding of how to manipulate and manage data using Pandas' CRUD functions.

You can get the life expectancy dataset here

Creating a Dataframe

To create a dataframe using the pandas library you first need to import the library and call the read_csv function. This function returns a dataframe.

on the first line we are importing all the contents of the pandas library and giving it an alias ‘pd’ to make it easier to work with. Next we call the ‘read_csv’ function, this allows us to read the contents of the specified csv file in our case it’s ‘life_expectancy.csv’ and then we print out the contents of the stored dataframe.

You should have something similar to the image below as the output.

Dataframes vs Arrays vs Lists

Congratulations on creating your first DataFrame! Now, let's dive deeper and explore the key differences between DataFrames, arrays, and lists in Python.

Data Structure:

DataFrames: Imagine a high-tech spreadsheet. DataFrames store data in a tabular format with rows and columns, allowing you to mix and match data types within the same structure. This makes them ideal for organizing and analyzing various kinds of information.
Arrays: Think of neatly arranged boxes on a shelf, all holding the same type of item. Arrays store collections of elements, but these elements must all be of the same data type (like integers or floats). This uniformity makes them super efficient for numerical computations.
Lists: These are like versatile backpacks – you can throw in all sorts of stuff, regardless of type. Lists allow you to store elements of different data types within a single structure, making them perfect for general-purpose data storage.

When to Use Each:

DataFrames: Reach for DataFrames when you're working with tabular data – think financial records, customer information, or scientific measurements. Their ability to hold mixed data types and perform data manipulation tasks makes them a natural fit for analysis.
Arrays: If you're dealing with heavy-duty numerical computations, arrays are your champion. Their optimized structure allows for blazing-fast calculations on large datasets of uniform data types.
Lists: Lists are your go-to choice for general-purpose data storage. Need to store a mix of names, numbers, and booleans? Lists have you covered. Their flexibility makes them a common tool for various programming tasks.

A Note on Flexibility:

While these guidelines provide a good starting point, remember that Python offers some flexibility. You can, for instance, convert between lists and DataFrames under certain conditions. However, understanding the strengths and weaknesses of each data structure will help you choose the most efficient tool for your specific needs.

The CRUD Operations within a Dataframe

Reading a column within a dataframe (Read Operation)

To read a column within a dataframe is rather easy, similar to reading the values of keys within a dictionary all you need to pass a string of the name of the column you wish to read next to the name of the dataframe that contains the column. Here’s how:

In the diagram above we are going to read the contents of the column with the title ‘GDP’ and you should have an output similar to the one below:

Creating a new column within a dataframe (Write Operation)

Creating a new column within an existing dataframe is very similar to reading a column, the only difference is the presence of an assignment operation where a value is assigned to the new dataframe column.

Let’s take a look at how we can create new column with the dataframe storing a scaled value of the ‘GDP’ column we read earlier;

Here we are assigning a new column ‘Scaled GDP’ to the squared value of the values of the column ‘GDP’ and printing the created column afterwards.

Your output should look something like this:

Updating an Existing Column within a Dataframe (Update Operation)

To update a column within a dataframe, all you have to do is assign the new value to the column within the dataframe similar to creating a new column. Here’s how:

In this scenario we are updating the existing column ‘Scaled GDP’ to have a value double the values of the ‘GDP’ column instead of a squared value as seen earlier.

Deleting an Existing Column within a Dataframe (Delete Operation)

Great! you’ve made it to the last operation within this topic, deleting columns within a dataframe and to make it a little bit engaging i would leave a hint to help you achieve the task. The task is as follows:

Using the del keyword I want you to remove the ‘Scaled GDP’ column from the created dataframe.

Conclusion

Now that you have an idea on what a dataframe is, in the next article we would be talking about how computers learn, how do they convert large datasets into intuitive outputs that shapes the tech industry today! See you then.

ML For Everyone: Dataframes and Series

Table of Contents