Data Science in Drilling

Data Accessing and Sorting in Pandas

written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI

Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.

This is another Pandas episode. Enjoy!

Enjoying great knowledge is just like enjoying delicious chocolate puppy.

Introduction

Data accessing and sorting are another two important topics in Pandas. In this blog post, these two topics will be covered in depth.

What We'll Cover Today

Different ways of accessing data in Pandas Series/DataFrames.
Different ways of sorting data in Pandas Series/DataFrames.

Data Accessing in Pandas

Let's first talk about data accessing in Pandas Series. Define a dummy Series for testing as follows:

series_1 = pd.Series(
    np.random.randint(10, size=5),
    index=['Kobe', 'Lebron', 'MJ', 'Kevin', 'Jason']
)
series_1

The dummy Series is:

Kobe      2
Lebron    9
MJ        3
Kevin     4
Jason     0
dtype: int64

The data of a Series can be accessed through its indices, for example:

series_1['Lebron']

The result is:

Numeric indices can also be used for data accessing, for example:

series_1[3]

The result is:

This is how to access consecutive data through numerical indices:

series_1[:3]

The resulted Series is:

Kobe      2
Lebron    9
MJ        3
dtype: int64

Consecutive data can also be accessed through alphabetical indices, for example:

series_1['Lebron':'Jason']

The resulted Series is:

Lebron    9
MJ        3
Kevin     4
Jason     0
dtype: int64

To access specific groups of data through alphabetical indices, a list of alphabetical indices needs to be provided, for example:

series_1[['Kobe', 'MJ']]

The resulted Series is:

Kobe    2
MJ      3
dtype: int64

Specific groups of data can also be accessed through numerical indices, for example:

series_1[[1, 3]]

The resulted Series is:

Lebron    9
Kevin     4
dtype: int64

Say that we only want the data in the Series which are less than or equal to 3, this is how to get them:

series_1[series_1 <= 3]

The resulted Series is:

Kobe     2
MJ       3
Jason    0
dtype: int64

This is how we can set the values of the DataFrame through indices:

series_1['Lebron':'Kevin'] = 100
series_1

The resulted Series is:

Kobe        2
Lebron    100
MJ        100
Kevin     100
Jason       0
dtype: int64

Data accessing patterns are similar for DataFrames. Define a dummy DataFrame for testing as follows:

df_1 = pd.DataFrame(
    np.random.randint(10, size=16).reshape(4, 4),
    index=['Michael', 'Kobe', 'Lebron', 'Kevin'],
    columns=['a', 'b', 'c', 'd']
)
df_1

The dummy DataFrame is:

First let's try to access a column of data through column name:

df_1['c']

The resulted Series is:

Michael    6
Kobe       4
Lebron     8
Kevin      2
Name: c, dtype: int64

If data from multiple columns are needed, a list of column names need to be provided, for example:

df_1[['a', 'd']]

The resulted DataFrame is:

Rows of the DataFrame can be accessed using the numerical indices. This is how we can access the first two rows of data of the DataFrame:

df_1[:2]

The resulted DataFrame is:

Conditions can be applied to a specific column to filter the data. This is how we can make all the rows returned from the DataFrame satisfy that the data from column a is greater than or equal to 4:

df_1[df_1['a'] >= 4]

The resulted DataFrame is:

The conditions can also be applied to all the data from the DataFrame. This is how we can make all the data from the DataFrame whose value is greater than or equal to 4 be set to 100:

df_1[df_1 >= 4] = 100
df_1

The resulted DataFrame is:

The iloc and loc methods are two important methods for accessing data in a DataFrame. The former is for data accessing through numerical indices and columns, while the latter is for data accessing through alphabetical indices and columns. Let's first try to access the 3rd row of the DataFrame using the iloc method:

df_1.iloc[2]

The resulted Series is:

a    100
b      3
c    100
d    100
Name: Lebron, dtype: int64

This is how to access the value of a specific cell located at row 3, column 3:

df_1.iloc[2, 2]

The result is:

Now we want the data from row 3, columns 2 and 4:

df_1.iloc[2, [1, 3]]

The resulted Series is:

b      3
d    100
Name: Lebron, dtype: int64

This is how the data from rows 1 and 4, columns 2 and 4 can be retrieved:

df_1.iloc[[0, 3], [1, 3]]

The resulted DataFrame is:

We can also access the data from the first 2 consecutive rows and the first 3 consecutive columns:

df_1.iloc[:2, :3]

The resulted DataFrame is:

Now let's take a look at the loc method. This is how we can access a row of data through its alphabetical index:

df_1.loc['Michael']

The resulted Series is:

a      0
b      0
c    100
d    100
Name: Michael, dtype: int64

This is how to access the value of a specific cell from the Michael row and the c column:

df_1.loc['Michael', 'c']

The result is:

We can also access the data from the Michael row and both the a and c columns:

df_1.loc['Michael', ['a', 'c']]

The resulted Series is:

a      0
c    100
Name: Michael, dtype: int64

Now let's access the data from both the Michael and the Kevin rows, and both the a and c columns:

df_1.loc[['Michael', 'Kevin'], ['a', 'c']]

The resulted DataFrame is:

Data from both consecutive rows and columns can also be accessed alphabetically:

df_1.loc[:'Lebron', :'c']

The resulted DataFrame is:

Data Sorting in Pandas

Let's first talk about data sorting in Series. Define a dummy Series for testing as follows:

series_2 = pd.Series(
    np.arange(1, 6),
    index=['e', 'a', 'c', 'd', 'b']
)
series_2

The dummy Series is:

a    2
b    5
c    3
d    4
e    1
dtype: int64

The Series can be sorted through its indices:

series_2.sort_index()

The resulted Series is:

a    2
b    5
c    3
d    4
e    1
dtype: int64

One thing to be noted is that the above operation returns a new Series rather than overwriting the original Series. To overwrite the original Series, inplace=True option needs to be used:

series_2.sort_index(inplace=True)
series_2

series_2 becomes:

a    2
b    5
c    3
d    4
e    1
dtype: int64

A Series can also be sorted through its values:

series_2.sort_values()

The resulted Series is:

e    1
a    2
c    3
d    4
b    5
dtype: int64

The default sorting is in ascending order. To sort the values in a descending order, the ascending=False option needs to be used:

series_2.sort_values(ascending=False)

The resulted Series is:

b    5
d    4
c    3
a    2
e    1
dtype: int64

A question worth mentioning is that if the Series contains NaNs, then how will sorting in Pandas handle NaNs? To test this, set one of the values in the Series to be NaN:

series_2['a'] = np.nan
series_2

The Series becomes:

a    NaN
b    5.0
c    3.0
d    4.0
e    1.0
dtype: float64

First sort the Series by its values in an ascending order:

series_2.sort_values()

The resulted Series is:

e    1.0
c    3.0
d    4.0
b    5.0
a    NaN
dtype: float64

Then sort the Series by its values in a descending order:

series_2.sort_values(ascending=False)

The resulted Series is:

b    5.0
d    4.0
c    3.0
e    1.0
a    NaN
dtype: float64

It can be seen that no matter sorting the Series by value in an ascending or a descending order, the NaNs will always be placed at the tail of the sorted Series. Another thing to be noted is that the sort_values operation also returns a new Series rather than overwriting the original Series. To overwrite the original Series, inplace=True option also needs to be used.

To demonstrate how to sort the values of a DataFrame, define another dummy DataFrame as follows:

df_2 = pd.DataFrame(
    np.random.randint(10, size=16).reshape(4, 4),
    index=['gamma', 'beta', 'alpha', 'delta'],
    columns=['d', 'a', 'c', 'b']
)
df_2

The dummy DataFrame is:

The rows of the DataFrame can be sorted by its indices:

df_2.sort_index(axis=0)

The resulted DataFrame is:

The columns of the DataFrame can also be sorted by its column names:

df_2.sort_index(axis=1)

The resulted DataFrame is:

This is how to sort the columns of the DataFrame by its column names in a descending order:

df_2.sort_index(axis=1, ascending=False)

The resulted DataFrame is:

The inplace=True option can also be used to overwrite the original DataFrame:

df_2.sort_index(axis=0, inplace=True)
df_2.sort_index(axis=1, inplace=True)
df_2

df_2 becomes:

This is how to sort the rows of the DataFrame by the values of column b:

df_2.sort_values(by='b')

The resulted DataFrame is:

We can also sort the rows of the DataFrame by the values of column c in a descending order:

df_2.sort_values(by='c', ascending=False)

The resulted DataFrame is:

To demonstrate how to sort the rows of the DataFrame by multiple columns, let's first rewrite the value of one of its cells:

df_2.loc['beta', 'c'] = 4
df_2

df_2 becomes:

Let's sort the rows of the DataFrame by column c in a descending order and column d in an ascending order:

df_2.sort_values(by=['c', 'd'], ascending=[False, True])

The resulted DataFrame is:

It is seen that in the case when there is a tie on the values in column c, the rows are ordered by column d in an ascending order.

Lastly, just like the sort_values method on Series, the sort_values method on DataFrames also returns a new DataFrame rather than overwriting the original DataFrame. To overwrite the original DataFrame, inplace=True option also needs to be used:

df_2.sort_values(by=['c', 'd'], ascending=[False, True], inplace=True)
df_2

Now df_2 becomes:

Conclusions

In this article, we mainly went through how to access and sort data in Pandas Series/DataFrames. Hope you enjoy it! More skills about Pandas will be covered in the future episodes. Stay tuned!

Get in Touch

Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.

If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Data Science in Drilling - Episode 9

Data Accessing and Sorting in Pandas

Introduction

What We'll Cover Today

Data Accessing in Pandas

Data Sorting in Pandas

Conclusions

Get in Touch

Recent Posts

Comments