Creating a Data Scientist Team from Scratch

Dr. Farruh
5 min readOct 23, 2020

--

Intro: what a data scientist is, and why 93% of companies need a data scientist
In the 21st century, IT development is critical for governments and companies, and AI implementation is even more significant, which requires data scientists with good backgrounds and knowledge packages. However, the lack of specialists, especially in countries where AI is still new and AI-related courses are not yet available in universities, complicates this AI implementation task. Luckily, if creating a data scientist team starting from scratch and educating the potential data scientists in the meanwhile, things will get easier for a company.

Data sciences and data analytics are such rapidly growing fields, and thus there is a shortage of qualified applicants for the number of jobs available. Since it is a challenge for the company to find suitable experts in this field, it is better to create a data scientists team from scratch than to import experts to support the relocation fees and spend a lot on high HR costs.

As a data scientist, I would be happy and excited to introduce Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) to more people. I made my first data scientist team from zero in May 2018 for a Swiss company. The purpose of creating the team was to initiate an AI project to bring benefits to the company. The idea arrived before we taught candidates AI/ML/DL knowledge. We decided that candidates, who showed their talents in these areas, even if they didn’t have the opportunity to learn or had no chance to learn AI systematically, would be picked up to work with us.

Basic Knowledge for a Data Scientist
In general, a data scientist needs to have software engineer skills, statistician knowledge, and a healthy dose of experience in the industry in which one wants to work. Roughly 90% of data scientists have basic education — all the way up to PhDs — but the fields they earn their degrees vary widely.

Statistical analysis and the know-how of leveraging computing frameworks’ power are to mine, process, or present the value of unstructured bulk data, which is actually the most important technical skill required to become a data scientist.

This means that you need to be skillful in maths, programming, and statistics. One way of complying with the prerequisite is to have a resonating academic background.

Data scientists usually have a Ph.D. or Master's in statistics, computer science, or engineering. These learning experiences give them a strong foundation to connect with the technical points that form the core of data science practice.

Some universities now offer specialized programs tailored to the educational requirements for pursuing a career in data science.

While there are Massive Open Online Courses (MOOCs) and boot camps for those who want to pursue other options if they don’t want to opt for the focused-but-extensive approach to studying the tailor-made programs, some program-offering-options worth exploring include Simplilearn’s Big Data & Analytics certification courses, which can help deepen learners’ understanding of the core subjects that support a data scientist’s practice while also providing a practical learning approach that otherwise cannot be found in the confines of the textbook.

My First Meeting with the Team
After a long time’s preparation and discussion, the first meeting was held in November 2018. We started to find as many candidates as possible and selected the ten best candidates from them. Meanwhile, We gave relevant lectures and opened social network channels to attract more candidates. I prepared a lecture on the topic of “The World of Artificial Intelligence” at Inha University (www.inha.uz) (https://inha.uz/en/news/697/), which motivated young students to join the AI field.

Start the Education Project
After ten candidates were selected to learn AI/ML/DL, we introduced the free and famous course of Andrew Ng Machine Learning (https://www.coursera.org/learn/machine-learning) as the entry-level of study for them.

The distance between Shanghai — Tashkent — Swiss and the different time gaps brought a few challenges to us at the beginning. Still, as the projects continued, we found a way to conquer the problems. After a discussion with ten candidates, we spent four weeks intensively studying the Andrew Ng course. Still, later when investigating, we extended another four weeks to study. Thus, it took us two months to learn the total Andrew Ng Machine Learning courses. Besides, we gave them different mathematical tasks related to ML and simulation tasks. At the end of the course, candidates had exams, and two of them successfully passed the tests; seven of them didn’t pass the tests at first, but they passed the exams the second time. With this successful case, we continued AI courses in Tashkent and went to open the second round to teach candidates new programs, upgraded from previous experiments.

The Initial Project
The project implemented face recognition inside the company to recognize customers and employees. Before starting the project, as the team leads, I was supposed to know those fresh data scientists’ capabilities in data cleaning, feature engineering, and optimization algorithms. Therefore, I gave them simple modeling tasks and checked their KPIs afterward. Later after I defined their strength, I gave them tasks related to face recognition.

Our old data scientist team found open data sources to train the model. Why did we seek open data sources? Usually, creating a dataset from zero is the most expensive part of AI. Hence, we found the following dataset:

Then we started to do data cleaning, feature engineering, mathematical modeling, and others. Our Data Scientist team implemented UI for face recognition and put it on the server side. After those guys were well-trained in the initial project, it was time to move on to the real project.

The Container Number Recognition (CNR)
The CNR performs reading and identification of ISO 6346 container codes in logistic ports and handling cranes. The intelligent system allows us to manage several lanes from a single post and perform access control and efficient recognition of the containers and the trucks in charge of their transportation. CNR is based on deep learning and optical character recognition.

This project is still ongoing.

Conclusion
Within three months, we did create the data scientist team from scratch. Plus, after after two months of practicing with open-data resources, our data scientist team was ready and could start working on a real project in five months.

Feel excited? Pull up your sleeves now, and you will make it too!

--

--

Dr. Farruh
Dr. Farruh

Written by Dr. Farruh

A self‐sufficient Solution Architect with experience in executing data‐driven solutions to increase efficiency and accuracy. Always passionate about high-tech

No responses yet