I have been working on a data analytics project for around 3 weeks, the project aims to visualize and allow querying a database of employees based on their skills, industry, and specialty. It is a very interesting and challenging project, it sounds fairly simple, yet it is taking a surprising amount of time; this is not a bad thing, as I was taking this opportunity to verify a certain fact in data science.
- Combine all rows for an employee into one row
- Clean the data types
- Convert to JSON
It took me a week just to clean the data types, and this was just the first step in the project: uploading the data to Cloudant NoSQL database. One might argue why did I use JSON and NoSQL whereas I could have used a table format and SQL database? There are two main reasons, primarily because I am more comfortable working with NoSQL, and second because I was doing an experiment.
Then came the challenge of querying the data, once I received the query identifying the requested combination of region, skills, industries, and specialty. Structuring the data right for a query was a challenge which took around 3 days to address; if it weren’t for the Pandas library, I would have taken maybe a week or two. Funny enough, the total time I spent on building the structure of the web app, log in, and user interface all in all took around 2 or 3 days.
This little experiment of mine shows a very important fact about data science and analytics:
“80% of the time is spent cleaning the data”
I spent around 10 days to clean and prepare the data, and just 4 days to query and build the web app. Lucky enough I was doing everything in Python which provides a set of great tools and libraries for data science. My choice of database was not the best for this application, but in a real-life situation, not everything is so sweet, you almost always have to restructure, reformat, and reorganize the data.