Good Software Engineer?

Usually it takes sometime for me to answer such question or I just give my best two cents smile because the context is not well defined. We hear a lot of myth of being good software engineers and mostly more familiar with the other side of the equation – the bad habits of software engineers, such as lazy-pseudo-hacker, dictator, brute-force superman, careless tai-chi master, overcautious, Google-bot, documentation-hater, not-a-tester, dirty coders, ninja-turtle, or maybe to be more business-centric, we call them short-term-investor. You don’t need to consult the experts in Quora to get to know their bad behaviors, as obviously you are familiar.

Let me make my own definition on the question – if we care about bad, good or great – or any level of judges, it means we talk about employment role inside company as software engineer, where people have to work with others. My opinion in this post will not work for free-mind individuals who are doing it only for fun to achieve masterpiece, they don’t have KPI or job evaluation by the way.

First of all, most long answers in Quora on “How to Become Good Software Engineer” dominated by Googlers (less Microsofties nowadays but may change later) who proudly explained importance of fundamental or analytical part to perform the role, like math, compiler, algorithm/data structure, crypto, parallel-distributed algorithm, artificial intelligence, or practical skills like coding, programming language, test, debugging, framework, tools, domain expertise, SDLC etc. It looks like you need Master degree or maybe PhD from top CS school before taking a role at Google (or later MBA to ladder up once you get in). Uppss – as I mentioned AI – that requires numerical/scientific computing, statistics, probabilistic, and many more. I’m lazy to write all but to give it a name I pick hard skills. (I am free to give any name to anything, as this is my blog).

OK – before you think you are not fit to become good Software Engineer because of above paragraph, you may have to look Life and Time of Anders Hejlsberg sometime, not only Martin OderskyKen Thompson or Terence Tao who have proper genius level academic background. Many other stories like that we can found – even though I’m not really recommending it for my kids. It may helps to know all of those academic hard skills, but no guarantee at all.

In any profession that requires people to work with other people, it takes time to become good at, not only in engineering. Why? The problem domain is much bigger. Software engineering is no longer single person craftsmanship to write codes and the physics of it becomes physics of people and its business focus. If I list down the competencies to deal with people and business, you’ll get longer list for sure. Just to name a few, communication, writing, presentation, listening, business-awareness, social-awareness, time management, daily discipline, product planning, estimation, and the scariest one – leadership. There is no fundamental law to deal with people and business till now. All of us have to experience, read, learn, think, discuss and practice a lot of things to get better on it. Lets give it name, soft skills.

Assuming you agreed on my naming convention – hard and soft skills are not something people can build in short time period. To be a good software engineer, you have to balance both, gradually become good at both. Yes, it takes time but if you really want do it, it is not impossible. How long? Most of young fresh graduate will NOT like my consistent statistical answer, 10 years to be good at something. Lucky if one started early by any reason, for example he started to love math and code since high school because his father bought him a computer and he got to love it. As I used love to state a more than casual interest, don’t be confuse. You can pick any other profession – I believe even Clown Balls needs 10 years to be good at balancing practical and entertaining skills.

People who are not accepting the statistical rule of 10 years usually choose to under estimate the process or even worst the contents of learning. They will say – fundamental/analytical knowledge and skills is perfectionism, they are biased to weight more on the practical/pragmatic skills. In fact – balancing is key and it takes time.  Hard vs Soft, Perfectionism vs Pragmatism, Sprint vs Marathon, Business vs Technical, all needed to be balance and it takes time.

Be humblebold to accept the 10 years rule will give you steady state to focus on building competencies. No matter where you started – from high school math level or top CS school level. People who started earlier will secure more time, but if you were not, don’t worry. Average working period now is around 30 years, you can still spend your first 10 years to be good software engineer competency, push yourself hard (usually only hard at the beginning), and stay passionate about it. The problem is when you are not accepting the rule and live with full of biases for later ending up wasting time never be good at anything. Assuming you do, then you can read on all of Googlers formula in Quora and start building your hard+soft skills with deep humility. Enjoy the process and find the beauty of small things you do. As Feynman said – There’s plenty of room at the bottom where you can find beauty of small things.

Once you decided, then you have to pay the price. Software engineering works requires you to have durability to focus (without distractions) on specific analytic, craftsmanship, people and business related problems. Your System 2 have to work 4-8 hours a day consistently and your System 1 perhaps same or lesser. System 2 is part of your brain that is slower, more deliberative, effortful, infrequent, logical, calculating, conscious and more logical. System 1 is fast, automatic, frequent, emotional, stereotypic, subconscious. Yes, it is hard for average people to balance between System 1 and 2, not easy.

Look back. You started with math in school, then learn other fundamentals gradually. Be good at algorithm and data structure first, try to translate the computation concept to programming languages, get to learn not-only-one language (but be really good at one first), get a job at good-culture company, write lot of codes, invest your time to think before code, read other people codes, read papers to help you solve complex technical problems, care about quality of code you write, communicate, documenting, testing, and all of those engineering, people and business stuffs. Yes, never forget the balance between hard and soft skills. With availability of MOOC now – like Coursera, Udacity and Edx, you can learn from the best even for free. You can get best Professors teach you on things you want to learn. Enjoy the journey, as it is really worth to pursue nowadays. Don’t get distracted by Startups dream if you are not really ready (be honest on your assessment). If you think you have the basics and already decided your commitment to work – then you can send me your CV.

Hope this helps!.

A Quick Intro to Spark

Fast computers and cheaper memory have stimulated the rapid growth of a new way of doing data computation. During this time, parallel computation infrastructures have evolved from experimental in a lab to become everyday tools of data scientists who need to analyze and get insights from data. However, barriers to the widespread use of parallelism are still at least one of three common large subdivision of computing – hardware, algorithms and software.

Imagine old days when we should deal with all of those three at same time, from high speed intercommunication network switches, parallelizing sequential algorithms and various software stacks from compilers, libraries, frameworks and middleware. Many parallelism models have been introduced for decades, like data partitioning in old day FORTRAN or other SIMD machines, shared memory parallelisms and message passing (remember C/C++ MS-MPI cluster in old days). I am part of generation who faced the “Dark Age” period of distributed numerical computing. Now is much better, I hope.

Apache Spark – is fast and general-purpose cluster computing system. Spark promises to make our life easier in writing distributed programs like other normal programs by abstracting away the “nitty-gritty” details of distributed systems – like my previous experiences with message passing (MPI). I know it is too early to make prediction on the success of Spark, but I’m biased with my previous distributed system experiences and liked to continue that in this post ☺.

We all need speed in data computation. Imagine if your forecasting analytic on large business datasets requires a day to complete while your business people expected it to produce results in hours or minutes – nowcasting vs forecasting. On the speed side, Spark extended the MapReduce model to supports more types of computations like batch, iterative/recursive algorithms, interactive queries and micro-batch streaming processing. Spark makes it easy and inexpensive (as price of CPU and GPU become cheaper) to run those processing types and reduces the burdens of maintaining infrastructure, tools and frameworks. Spark is designed to be friendly for developers, offered language bindings to Python, Java, Scala, R (via SparkR) and SQL (SparkSQL), and of course shipped with ready to use libraries such as GraphX and MLLib. A growing supports from Deep Machine Learning practitioners are also happening, like H2O Sparkling Water, DL4J and Prediction.IO.  It also integrated closely with other big data tools, like Hadoop, YARN, HBase, Cassandra, Mesos, Hive, etc. Spark ecosystem is growing very fast.

Spark started in 2009 as a research project in UC Berkeley RAD Lab (AMPLab). The researchers in AMPLab that previously work with Hadoop MapReduce found that MapReduce was inefficient for iterative and interactive computing jobs. You can refer to some research papers for better scientific proofs, or following a thriving OSS developer community around Spark, including famous startups like DataBricks.

Let me share my hacking experiences on Apache Spark. Spark is written in Scala and requires JVM to run. If you want to work with Python later, you may need to install Python package like Anaconda that combine all frameworks you need for scientific computing – include the famous Jupiter Notebook. I started by downloading Spark binary then later source codes to build on my Mac machine. A straightforward maven based compilation took sometime (~24 minutes) till I can run spark shell. But I was impatient; so during the compilation I just downloaded and used the binary version (now version 1.4.0) to test some commands. The good fact was I can use Spark without Hadoop, even in my single Mac machine to practice its basic principles. When Spark was ready in my machine, I just followed the README.md file (good habits of a geek) to test it, for example in Scala shell:

./spark-shell
scala> sc.parallelize(1 to 1000).count()

or in Python shell:

./pyspark
>>> sc.parallelize(range(1000)).count()

Spark comes with several sample programs in the `examples` directory. To run one of them, I used `./bin/run-example <class> [params]`. Here for example for SparkPi:

./bin/run-example SparkPi

First thing I learnt about Spark was to make custom driver program that launches various parallel operations on my single machine Spark instance. The driver program (we can write in Python, Scala, Java, and R), contains main function and defines distributed datasets on the cluster, then applies data transformation actions to them. Spark shells are obvious examples of driver programs that access Spark through a SparkContext object, which represents a connection to a Spark’s computing cluster. In the any shell, a SparkContext is predefined for us as sc object, like in above examples. Spark default distribution (now version 1.4.0) provides spark-shell (for Scala) and pyspark (for Python) for interactive computing with sc object.

Second thing I learnt was about Spark’s main abstractions for working with distributed data, the RDD (Resilient Distributed Dataset), a distributed immutable collection of objects. In a clustered environment, each RDD is split into multiple partitions, which may be computed on different nodes. Programming in Spark is expressed as either creating new RDDs from data sources, transforming existing RDDs, or perform actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across cluster and parallelizes the operations we want to perform on them.

RDDs can contain any type of Python, Java, Scala, or R (through SparkR) objects, including user- defined classes. Users can create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program. Once created, RDDs offer two types of operations: transformations and actions. Transformations construct a new RDD from a previous one. Spark context object directly provides a lot of functions to perform RDD transformations. Actions, on the other side, compute a result based on an RDD, and either returns it to the driver program or save it to external storage systems like HDFS, HBase, Cassandra, ElasticSearch etc.

For example – Python filtering of README.md file:
>>> lines = sc.textFile(“README.md”)
>>> pythonLines = lines.filter(lambda line: “Python” in line)
>>> pythonLines.first()

And Scala filtering version for the same file:
scala> val lines = sc.textFile(“README.md”)
scala> val pythonLines = lines.filter(line => line.contains(“Python”))
scala> pythonLines.first()

Finally, third thing I learnt was about a lazy fashion of Spark execution. Although we can define new RDDs any time, Spark computes them only in a lazy fashion—that is, the first time they are used in an action. Spark’s RDDs are by default recomputed each time we run an action. To reuse an RDD in multiple actions, we can ask Spark to persist data in a number of different places using RDD.persist(). After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible. The behavior of not persisting by default may again seem unusual, but it makes a lot of sense for big datasets: if you will not reuse the RDD, there’s no reason to waste storage space when Spark could instead stream through the data once and just compute the result. In real practice, we will often use persist() to load a subset of data into memory and query it repeatedly.

Example of persisting previous RDD in memory:

>>> pythonLines.persist
>>> pythonLines.count()
>>> pythonLines.first()

As this is just a quick intro to Spark, lot more to hack if you are curious. To learn more, read the official Spark programming guide. If you prefer MOOC style, I recommend eDX- BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark from DataBricks. Books can also help your learning curve, you can try these :

  1. Learning Spark: Lightning-Fast Big Data Analysis
  2. Advanced Analytics with Spark: Patterns for Learning from Data at Scale
  3. Machine Learning with Spark

Lastly, hacking specific computation problems is always better way to learn. Good luck with your Spark hacking!

Data Science – Science or Art?

People called it sexiest job of 21st century, hot and growing field that needs millions or billions resources in future. But what is that? I found it confusing at the beginning, as there is ambiguity to split between substances of science and methodologies on solving scientific problems through data computations. Since the beginning, the purpose of computing is insight, not the data. Thus computing is, or at least should be, intimately bound up with both the source of scientific problems and the model that is going to be made of the answers, it is not a step to be taken in isolation from physical reality. As a “failed theoretical physicist” of course I am very biased.

The Venn diagram model that widely accepted (many books refers to it), defines data science as intersections between hacking skills, math and stats knowledge, and substantive expertise. Although I really want to argue it, I quickly realize “substantive expertise” is open for any area of scientific topics; hence I will again waste my time to argue in open area. Even after consulted to Wikipedia, that defines data science as extraction of knowledge from large volume of data that aren’t structured, I’m still deeply confused. Nevertheless, let it be my problem, not yours. It is well known that IT industry has “unique” behavior to give confusing names to same thing.

Assuming I can push myself to accept data science definition from Wikipedia (never in reality), how can I relate the science? In science, there is a set of rules (the fundamental laws of nature) in operation, and task of scientists is to figure out what the rules are, by observing the results (data) that occur when the rules are followed. Simply said – it is an attempt to “reverse-engineer hack” on machinery of the nature. Even in math, it’s the other way around, to choose the rules (or model) and discover the insights of choosing any particular set of models. There is a superficial similarity, which leads to my other confusion.

In science, the way we test a theory is to codify it as a set of models and then explore the consequences of those models – in effects; to predict what would happen if those models were true. People do same thing in math, and in fact, the way its done in math serves as a model for the way its done in science, sometime. But the big difference is: in science, as soon as our predictions conflict with experimental data from nature, we are done. We know that our models are wrong and need to modify it. In math, this kind of conflict is minimal, because there is no necessary connection between any theory and the world. As long as it is still interesting enough to induce mathematicians to keep work on it, then it will continue to be explored.

Data science – to what we know so far in IT industry refers to collection of tools and methods to get insights from data (not necessarily large or big), by analyzing it with various computation techniques and later communicate (or consume) the insights through visualization (or else). It typically deals with data that mostly un-structured, collected from users, computer systems or other like sensors, without single predefined formats. Long debates in online forums regarding its definitions, and as it is still hyped-up, it will takes more time till it finally landed to earth again. It may because of legacy of computer science, which also in debate for decades.

People who come from statistic or math background will argue that data science is mostly about statistical analysis on data using modern tools, languages, libraries and computing infrastructures. By hacking those technologies they can work to produce insights from data with statistical methods. On other case if their background is physics for instance, they will think of numerical methods or computer simulations to fit modeled hypothesis to experimental data. From computer scientists, who have explored areas of information retrieval, for example, will proudly claimed that finally machine learning has a better name. Ex-scientists who are good in programming and programmers who are good in statistics and scientific/numerical computing. All may true subjectively – but if you look around the reality, variety of languages, tools, methods, and techniques for data analytic leads to an art instead of science. Yes, data analytic art if you need a new name again (data artist probably better title?). But no, it is not attractive enough, as taken by digital artists previously. Disclaimer: In IT business, we are in high demand of new hype and jargon (Read Gartner Hype Cycle 2014). So lets stick with data science as normally accepted as growing trend.

What actually data scientists do? Is it covering collecting and pre-processing the data, formulating hypothesis, identifying algorithms/tools that fit, performing computation, communicating insights and creating abstractions for higher level business people? Yes, perhaps those all written in their resumes mixed between software/data engineering and data analytical tasks. As it still far from maturity, roles and responsibilities may change over time (I believe it will become business roles not only IT), new data sources will explode with other hypes (such as Internet of Almost Stupid Things), companies who crafting automation tools/frameworks/platforms will emerge and raise more funds to innovate faster. More and more things can happen as art has no end. The art of machine intelligence is still going on progress. If we found way to un-supervised machine intelligence, many other things can happen, including we may not need data scientists and let the machines work for us. We all need to respond (or just do nothing) to anticipate this new hyped-trend. I choose to enjoy the show by hacking it!.