Data Scientist is the new job of the decade. Everyone is talking about becoming one or hiring one, but the big question is do we need data scientists? To answer the question, we must first understand what a data scientist does. From what the name suggests, data scientists are scientists who work with data, they do everything from collecting and cleaning data to analyzing and modeling data. They often are knowledgeable in machine learning and thus can develop machine learning models to predict data.
So why would you need a data scientist? There are 3 major questions you can ask yourself:
- Do I deal with data?
- Do I have big data?
- Do I need to make sense of data?
If you answered yes to all 3 questions then you probably should get a data scientist on board. The primary case for hiring a data scientist is the availability of large amounts of data, whether be it text, audio, video, or just numbers. Then with this data, a data scientist can help you extract trends and patterns, cluster and classify data into categories, and predict future data trends.
“Data are becoming the new raw material of business.” — Craig Mundie
Craig Mundie of Microsoft put it nicely, “Data are becoming the new raw materials of business.” This new paradigm opens up the door for new business opportunities; if done correctly, you can generate excellent stories out of data that can be offered as a business case. I would argue that the most important part of a data scientist’s job is actually making these stories; being able to explain the findings in a language clear and understandable by everyone and delivering value out of these findings is the key skill that distinguishes data scientists.
The main steps in any data science project is outlined in the CRISP-DM methodology, and this must be understood by everyone in the company, not just the data scientist. Essentially, CRISP-DM guides the whole process of a data science project from before data ingestion to deployment. The processes are as follow.
- Understand the need for insights on data, and the business case which the project will support.
- Understand the data available, and update the business case accordingly.
- Prepare the data by removing redundancy, fixing missing data, and formatting the data.
- Model the data
- Evaluate the mode, and based on that complete the business case
- Finally, deploy the model to use on future data
80% of the time is spent on cleaning the data.
The process is a rather time-consuming one, particularly the data preparation, as it can take up to 80% of the project’s time. Nonetheless, the outcomes of data science projects are of tremendous value since you gain insights often unseen before. This is the new business weapon that can turn the tables on competitors.