Pandas Determine if Column Categorical or Continuous
Are you wondering how to get all of the continuous columns in your Pandas DataFrame? Or maybe you are interested in selecting all of the categorical columns in your dataset? Well either way – you are in the right place! In this article we demonstrate how to separate continuous and categorical columns using Pandas.
In order to demonstrate how to separate continuous and categorical columns in Pandas, we will need a toy dataset to use in our examples. Throughout this article, we will use the following DataFrame to demonstrate how to get continuous and categorical columns using Pandas.
import pandas as pd data = { 'float_1': [0.1, 0.2, 0.3], 'float_2': [0.4, 0.5, 0.6], 'int_1': [1, 2, 3], 'int_2': [4, 5, 6], 'string_1': ['one', 'two', 'three'], 'string_2': ['four', 'five', 'six'], 'repeated_string_1': ['one', 'one', 'two'], 'repeated_string_2': ['four', 'four', 'five'], } df = pd.DataFrame(data)
Getting continuous columns with Pandas
Why would you want to get continuous columns in Pandas?
Before we talk about how to get continuous columns in Pandas, we will first discuss some reasons why you might want to get the continuous columns in a Pandas DataFrame.
- Normalize continuous columns. The first reason you might want to extract the continuous columns from a Pandas DataFrame is if you need to normalize your data. There are many methods that require numeric data to be normalized before the method can be applied. This is particularly common for methods that are affected by the scale of a variable and therefore perform best when all of the numeric data is on the same scale.
- Impute missing data. You may also want to separate your continuous and categorical data when you impute missing data. The methods that are used to impute data for continuous and categorical data can differ. As a very simple example, you might want to use the median to fill in missing values for continuous data and the mode to fill in missing data for categorical data.
- Calculate summary statistics. Finally, you might want to separate continuous data from categorical data if you want to compute specific summary statistics for your continuous data. You will generally want to compute different summaries for your continuous and categorical data.
How to get continuous columns in Pandas
Now that we have discussed some reasons why one might want to select all of the continuous columns in a Pandas DataFrame, we will show you how to select all of the continuous columns. So what is the easiest way to get all of the continuous columns in a Pandas DataFrame?
We recommend using the select_dtypes method on your pandas DataFrame. This method has arguments called include and exclude that can be used to determine what data types are included after the function is run. Here is an example of how to use the select_dtypes method to select numeric columns.
df.select_dtypes(include='number')
Getting categorical columns with Pandas
Why would you want to get categorical columns in Pandas?
Before we talk about how to select categorical columns in a Pandas DataFrame, we will first discuss some reasons why you might want to select only the categorical columns in a Pandas DataFrame.
- Group long tail categories together. One transformation that you would want to apply to categorical variables but not continuous variables is grouping long tail categories together. Long tail categories are sparse categories that do not have many observations belonging to them. If there is not enough data to make meaningful inference on these long tail categories, it often makes sense to group them together into a 'other' category.
- One hot encode variables. Depending on what method you will be applying to your cleaned data, you might need to one hot encode your data so that all of your categorical variables are represented by a series of binary variables. This is a transformation that you would only want to apply to your categorical variables.
- Impute missing data. As we mentioned before, the methods that are used to impute missing data often differ for categorical and continuous data.
- Calculate summary statistics. Again, the summary statistics that you want to calculate will often differ between categorical and continuous variables.
How to get categorical columns in Pandas
Now that we have talked about why you would want to extract the categorical columns from your Pandas DataFrame, we will discuss how to extract the categorical columns from your Pandas DataFame. We recommend doing this using the same select_dtypes method that we used to select continuous columns. However, this time we will use the exclude argument rather than the include argument so that we can exclude continuous columns.
df.select_dtypes(exclude='number')
How to get columns by type in Pandas
Before we sign off, we want to take a second to acknowledge that the select_dtypes method can be used to do more than just separate categorical columns from continuous columns. This method can be used to filter to only columns with one specific type. For example, the method can be used to select only continuous columns with integer values.
df.select_dtypes(include=int)
Source: https://crunchingthedata.com/continuous-categorical-columns-pandas/
0 Response to "Pandas Determine if Column Categorical or Continuous"
Post a Comment