Extracting Date Columns With any date Format

Rajvardhan Rawat
3 min readOct 4, 2020

Did you ever faced a situation when you are up-to something or your boss has thrown on you tons of data having so many unworthy variables or when you are simply tired of data engineering. In most of the real world datasets you may find date columns sometimes more than one and the date formats may vary from one organization/company to another.

Consider a data set having about 80 variables and 4 of them are dates which are useless for your ML model. Well don’t freak out because Here, I present you a way to deal with such a case. I will use a manually created data set having date columns in various formats.

The Data set somewhat looks like this —

Sample dataset

Conventional Way —

The tiring approach would be to create a function using re module in python and write down patterns for every possible date format. Or you may take help of NLP create a large data set and train it to identify a text as date or not and then use another trained network to convert the date format.

A single date format is necessary because there might be a case when you are not simply dropping the date columns but using them to do some date engineering. Lets say finding out a date is close to a holiday or not or taking difference of two dates from two date columns.

LexNLP -

LexNLP is a library for working with real, unstructured legal text, including contracts, plans, policies, procedures, and blah blah…Here I am going to use its very simple property to extract any date column from the dataset.

Here is a function for you. You can simply pass the data and get the results.

# Installing required package 
!pip install lexnlp
def DaysBtDates(data, impute = False):

# Converting data in any form into pandas dataframe
data = pd.DataFrame(data)

# Imputing with mode
if impute == True:
for i in range(data.shape[1]):
data.iloc[:,i] =data.iloc[:,i].fillna(data.iloc[:,i].mode()[0])

# Function to get column having dates in any possible format
def DateColumnExtractor(data):

# Creating empty list to store date column indexes
date_col_index = []
row_1 = data.iloc[1, :]

# Extracting Date columns using NLP module
for i,j in enumerate(row_1):

# Getting dates from each data point as string
text = str(j)
k = list(lexnlp.extract.en.dates.get_dates(text))

# lexnlp getdates method returns empty list if text is not a date
if len(k) != 0:
date_col_index.append(i)

date_columns = data.iloc[:, date_col_index]
return date_columns

The simple property that I am leveraging here is that getdates method of lexNLP returns an empty list If it doesn’t find any date columns. I am simply iterating over all the column values of the first row If its a date column its index number will be stored in date_columns container which is returned in the end.

Format Conversion —

  def FormatConversion():

date_columns = DateColumnExtractor(data)

date_formated = date_columns.copy()

for i in range(date_columns.shape[0]):
for j in range(date_columns.shape[1]):
text = str(date_columns.iloc[i,j])
k = list(lexnlp.extract.en.dates.get_dates(text))
date_formated.iloc[i, j] = k[0]
return date_columns,date_formated

This function converts any date format into YYYY/MM/DD format. Thus, Now date engineering becomes very easy.

I hope you will find this idea useful. This is my first blog on medium. I will try to improve myself in my upcoming blogs.

Thank You!

Please Find the whole Code in my github -

--

--