People called it sexiest job of 21st century, hot and growing field that needs millions or billions resources in future. But what is that? I found it confusing at the beginning, as there is ambiguity to split between substances of science and methodologies on solving scientific problems through data computations. Since the beginning, the purpose of computing is insight, not the data. Thus computing is, or at least should be, intimately bound up with both the source of scientific problems and the model that is going to be made of the answers, it is not a step to be taken in isolation from physical reality. As a “failed theoretical physicist” of course I am very biased.
The Venn diagram model that widely accepted (many books refers to it), defines data science as intersections between hacking skills, math and stats knowledge, and substantive expertise. Although I really want to argue it, I quickly realize “substantive expertise” is open for any area of scientific topics; hence I will again waste my time to argue in open area. Even after consulted to Wikipedia, that defines data science as extraction of knowledge from large volume of data that aren’t structured, I’m still deeply confused. Nevertheless, let it be my problem, not yours. It is well known that IT industry has “unique” behavior to give confusing names to same thing.
Assuming I can push myself to accept data science definition from Wikipedia (never in reality), how can I relate the science? In science, there is a set of rules (the fundamental laws of nature) in operation, and task of scientists is to figure out what the rules are, by observing the results (data) that occur when the rules are followed. Simply said – it is an attempt to “reverse-engineer hack” on machinery of the nature. Even in math, it’s the other way around, to choose the rules (or model) and discover the insights of choosing any particular set of models. There is a superficial similarity, which leads to my other confusion.
In science, the way we test a theory is to codify it as a set of models and then explore the consequences of those models – in effects; to predict what would happen if those models were true. People do same thing in math, and in fact, the way its done in math serves as a model for the way its done in science, sometime. But the big difference is: in science, as soon as our predictions conflict with experimental data from nature, we are done. We know that our models are wrong and need to modify it. In math, this kind of conflict is minimal, because there is no necessary connection between any theory and the world. As long as it is still interesting enough to induce mathematicians to keep work on it, then it will continue to be explored.
Data science – to what we know so far in IT industry refers to collection of tools and methods to get insights from data (not necessarily large or big), by analyzing it with various computation techniques and later communicate (or consume) the insights through visualization (or else). It typically deals with data that mostly un-structured, collected from users, computer systems or other like sensors, without single predefined formats. Long debates in online forums regarding its definitions, and as it is still hyped-up, it will takes more time till it finally landed to earth again. It may because of legacy of computer science, which also in debate for decades.
People who come from statistic or math background will argue that data science is mostly about statistical analysis on data using modern tools, languages, libraries and computing infrastructures. By hacking those technologies they can work to produce insights from data with statistical methods. On other case if their background is physics for instance, they will think of numerical methods or computer simulations to fit modeled hypothesis to experimental data. From computer scientists, who have explored areas of information retrieval, for example, will proudly claimed that finally machine learning has a better name. Ex-scientists who are good in programming and programmers who are good in statistics and scientific/numerical computing. All may true subjectively – but if you look around the reality, variety of languages, tools, methods, and techniques for data analytic leads to an art instead of science. Yes, data analytic art if you need a new name again (data artist probably better title?). But no, it is not attractive enough, as taken by digital artists previously. Disclaimer: In IT business, we are in high demand of new hype and jargon (Read Gartner Hype Cycle 2014). So lets stick with data science as normally accepted as growing trend.
What actually data scientists do? Is it covering collecting and pre-processing the data, formulating hypothesis, identifying algorithms/tools that fit, performing computation, communicating insights and creating abstractions for higher level business people? Yes, perhaps those all written in their resumes mixed between software/data engineering and data analytical tasks. As it still far from maturity, roles and responsibilities may change over time (I believe it will become business roles not only IT), new data sources will explode with other hypes (such as Internet of Almost Stupid Things), companies who crafting automation tools/frameworks/platforms will emerge and raise more funds to innovate faster. More and more things can happen as art has no end. The art of machine intelligence is still going on progress. If we found way to un-supervised machine intelligence, many other things can happen, including we may not need data scientists and let the machines work for us. We all need to respond (or just do nothing) to anticipate this new hyped-trend. I choose to enjoy the show by hacking it!.