I decided three months ago to explore a career in data science. My explorations have included Coursera courses, Medium articles, book reading, podcasts, and speaking to practitioners in the field at local Meetups.
From this immersion, I’ve gone from almost zero knowledge to knowing something. Whatever “something” is, I want to distill it down to the highlights and key points and create a record for future comparison. Hopefully, this public sharing will be of interest to those who enjoy learning new things. If you work in the field of big data/data science and see something that could use further explanation, please feel free to contribute with a reply.
Here is what I’ve learned.
What is “big data”? It’s the millions of Google searches per day. It’s the Netflix recommender system based on hundreds of millions of data points. It’s the data from every Fit Bit. It’s all the likes, shares and the videos themselves on YouTube. It’s about extremely large amounts of data (volume), it’s about the speed at which new data is generated (velocity), and its about the range of data, such as photos and videos, that don’t fit nicely into your typical relational database (variety).
All of this big data can be extremely valuable to companies, and both data scientists and data analysts are employed to best take advantage of it.
Big data can provide insights to better serve customers (recommender systems), to evaluate the impact of proposed changes to a website (A/B testing), or to determine which of their customers are most likely to leave within the next six months (k-means analysis, machine learning). Advances in artificial intelligence, such as self-driving cars, are driven by effective machine learning algorithms that process terabytes and petabytes of information. But in order to take advantage of all this data, a company needs an infrastructure that can handle it.
To process this firehose stream of data, especially for queries from data scientists and data analysts, one approach is to divide each packet of data into smaller packets and to have it processed by multiple servers (clusters) simultaneously. This “clustered” approach can process data queries much faster than the traditional single server approach, however, you need software (Hadoop, Apache Spark) that can manage the divide, compute and recombine process, including making backups of the data on multiple servers so that if one of the processing servers goes down, the query can still be completed.
So what does clustering have to do with “cloud computing”? Vendor products such as Amazon’s AWS, Microsoft’s Azure and the Google Cloud platform allow you to do clustering by leasing space on their servers. Instead of having to buy your own servers and having onsite staff to keep them up and running, you lease space on AWS and store all your data there (“in the cloud”). The advantages are that you can upgrade to faster speeds when needed and the changes take place in less than a day. You can also increase the number of servers used during peak volume times and then dial it down during off-peak periods: most of the services allow you to configure set up so that when 70 percent of server capacity is reached, additional servers are automatically added to handle the load.
Is it cheaper than owning your own servers? That’s a complex question that is going to have different answers depending on the current architecture and the duration of high peak volumes.
The biggest advantages are speed, flexibility to accommodate growing or significant fluctuations in demand, and uninterrupted service. Depending on the tolerance for the length of time that the database can be offline without serious impact to the business, a company can have backups of their data quickly restored on cloud servers that are 500 miles away, or even halfway around the world.
Once a company has their hardware infrastructure in place, you need to create an environment where queries, analyses and machine learning can be done without impacting performance of the database for day to day business operations.
If you’ve used products like Oracle or SAP, you are probably familiar with the concept of a “production database” for operations and a “test instance,” which is a mirror image of the database as of yesterday, for analysis.
With big data, you are creating a similar repository for analysis, often called a “data lake” when you are storing data from multiple systems. In order to pump data into the data lake, you need to build “data pipes”. Data pipes are scripts created by Data Engineers that pull the data (extract), convert the data into the best format for the data scientists (transform), and upload it to the data lake (load).
These data pipes, called “ETL pipes” for the extract-transform-load process, are essential for data scientists to build models for machine learning or for data analysts to do analyses. If you read reports of the high turnover and short tenures (less than 18 months) of data scientists, this is often the root cause and Achilles heel for companies. I’ve heard estimates that for every data scientist hired, you need two to five data engineers to support them: if the data scientists and data analysts can’t get access to the data, they’ve wasted their time joining a company that isn’t set up for success with big data. (The demand for data engineers right now, as you would imagine, is even higher than for data scientists.)