Extracting Date Columns With any date Format

Did you ever faced a situation when you are up-to something or your boss has thrown on you tons of data having so many unworthy variables or when you are simply tired of data engineering. In most of the real world datasets you may find date columns sometimes more than one and the date formats may vary from one organization/company to another.

Consider a data set having about 80 variables and 4 of them are dates which are useless for your ML model. Well don’t freak out because Here, I present you a way to deal with such a case. I will use a manually created data set having date columns in various formats.

The Data set somewhat looks like this —

Sample dataset

Conventional Way —

The tiring approach would be to create a function using re module in python and write down patterns for every possible date format. Or you may take help of NLP create a large data set and train it to identify a text as date or not and then use another trained network to convert the date format.

A single date format is necessary because there might be a case when you are not simply dropping the date columns but using them to do some date engineering. Lets say finding out a date is close to a holiday or not or taking difference of two dates from two date columns.

LexNLP -

LexNLP is a library for working with real, unstructured legal text, including contracts, plans, policies, procedures, and blah blah…Here I am going to use its very simple property to extract any date column from the dataset.

Here is a function for you. You can simply pass the data and get the results.

 
!pip install lexnlp
def DaysBtDates(data, impute = False):


data = pd.DataFrame(data)


if impute == True:
for i in range(data.shape[1]):
data.iloc[:,i] =data.iloc[:,i].fillna(data.iloc[:,i].mode()[0])


def DateColumnExtractor(data):


date_col_index = []
row_1 = data.iloc[1, :]


for i,j in enumerate(row_1):


text = str(j)
k = list(lexnlp.extract.en.dates.get_dates(text))


if len(k) != 0:
date_col_index.append(i)

date_columns = data.iloc[:, date_col_index]
return date_columns

The simple property that I am leveraging here is that getdates method of lexNLP returns an empty list If it doesn’t find any date columns. I am simply iterating over all the column values of the first row If its a date column its index number will be stored in date_columns container which is returned in the end.

Format Conversion —

  def FormatConversion():

date_columns = DateColumnExtractor(data)

date_formated = date_columns.copy()

for i in range(date_columns.shape[0]):
for j in range(date_columns.shape[1]):
text = str(date_columns.iloc[i,j])
k = list(lexnlp.extract.en.dates.get_dates(text))
date_formated.iloc[i, j] = k[0]
return date_columns,date_formated

This function converts any date format into format. Thus, Now date engineering becomes very easy.

I hope you will find this idea useful. This is my first blog on medium. I will try to improve myself in my upcoming blogs.

Thank You!

Please Find the whole Code in my github -

--

--

Petroleum Engineering, IIT(ISM) Dhanbad

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store