Why Becoming a Data Scientist is NOT Actually Easier Than You Think
Day in and day out, you probably go through articles highlighting how easy it is to become a data scientist. Most of these articles say that if one takes a machine learning course on Coursera, they will qualify to be a data scientist. However, this is not actually the case. Becoming a data scientist requires a combination of skills as opposed to basic machine learning algorithms. This is why companies that have employed qualified data science experts are maximizing their gains in the form of profits. Having qualified data scientists in your organization can help you make objective decisions based on data analysis. Data science experts should not only specialize in machine learning, but they should also have a comprehensive knowledge of coding algorithms, database analysis, and data interpretation. While you can take a machine learning course in Coursera, there are things you cannot learn there that qualify one to be a data scientist.
Programming languages plus other technologies
Most of the companies that employ data scientists usually use programming languages like Java, Python, Scala, and Ruby in their back-end web services. Few of these companies are using Matlab. However, these languages have not been covered by Coursera. For instance, Python consists of numerous libraries like Numpy, Scikit-learn, and Scipy. All these are quite helpful when you are solving numerical problems. It’s the same case with Java, which has a library that is good for statistics. All these, yet again, are not covered in the machine learning course. Thus, it will pose a big problem when your employer asks you to integrate an algorithm to a web service, something you know nothing about.
Big data software
Most of the problems that data scientists work on cannot be run on a machine operating on, for example, 500 MB RAM. They usually deal with large data sets, which sometimes require distributed data processing. This means one cannot be a qualified data scientist with a simple machine learning course at Coursera, as its data sets are usually tiny. A qualified data scientist must be able to understand map-reduce and distributed file systems, and also be able to effectively utilize Hadoop. You might have some knowledge on Java, but if you know nothing about Hadoop streaming—which also does not exist in Matlab—then you still cannot call yourself a data scientist.
One can use algorithms from a machine learning course and then create multiple different classifiers that can be used to tackle a real-world problem, but if your features do not look great, then your classifier will most certainly have a poor performance. If you have to extract features that are good, then you must comprehensively understand the problem at hand, the underlying data distributions, and more importantly, how the data has been produced. Feature extraction is not covered in Coursera, but it is the knowledge that every data scientist should possess.
In order to become a good data scientist, you may need years of experience. Data science is much more than having an understanding of machine learning algorithms and how they function. The scope spreads to knowing the type of questions you will have to ask and how your answers are going to be passed on to the investors, management, and all concerned stakeholders. This is something that you cannot learn in a simple online machine learning course.