Cake
  • Log In
  • Sign Up
    • I decided three months ago to explore a career in data science.  My explorations have included Coursera courses, Medium articles, book reading, podcasts, and speaking to practitioners in the field at local Meetups.

      From this immersion, I’ve gone from almost zero knowledge to knowing something. Whatever “something” is, I want to distill it down to the highlights and key points and create a record for future comparison.  Hopefully, this public sharing will be of interest to those who enjoy learning new things. If you work in the field of big data/data science and see something that could use further explanation, please feel free to contribute with a reply.

      Here is what I’ve learned.

      What is “big data”?  It’s the millions of Google searches per day.  It’s the Netflix recommender system based on hundreds of millions of data points.  It’s the data from every Fit Bit. It’s all the likes, shares and the videos themselves on YouTube.  It’s about extremely large amounts of data (volume), it’s about the speed at which new data is generated (velocity), and its about the range of data, such as photos and videos, that don’t fit nicely into your typical relational database (variety).

      All of this big data can be extremely valuable to companies, and both data scientists and data analysts are employed to best take advantage of it.

      Big data can provide insights to better serve customers (recommender systems), to evaluate the impact of proposed changes to a website (A/B testing), or to determine which of their customers are most likely to leave within the next six months (k-means analysis, machine learning).  Advances in artificial intelligence, such as self-driving cars, are driven by effective machine learning algorithms that process terabytes and petabytes of information. But in order to take advantage of all this data, a company needs an infrastructure that can handle it.

      To process this firehose stream of data, especially for queries from data scientists and data analysts, one approach is to divide each packet of data into smaller packets and to have it processed by multiple servers (clusters) simultaneously.  This “clustered” approach can process data queries much faster than the traditional single server approach, however, you need software (Hadoop, Apache Spark) that can manage the divide, compute and recombine process, including making backups of the data on multiple servers so that if one of the processing servers goes down, the query can still be completed.

      So what does clustering have to do with “cloud computing”?  Vendor products such as Amazon’s AWS, Microsoft’s Azure and the Google Cloud platform allow you to do clustering by leasing space on their servers.   Instead of having to buy your own servers and having onsite staff to keep them up and running, you lease space on AWS and store all your data there (“in the cloud”).  The advantages are that you can upgrade to faster speeds when needed and the changes take place in less than a day. You can also increase the number of servers used during peak volume times and then dial it down during off-peak periods: most of the services allow you to configure set up so that when 70 percent of server capacity is reached, additional servers are automatically added to handle the load.

      Is it cheaper than owning your own servers?  That’s a complex question that is going to have different answers depending on the current architecture and the duration of high peak volumes.

      The biggest advantages are speed, flexibility to accommodate growing or significant fluctuations in demand, and uninterrupted service.  Depending on the tolerance for the length of time that the database can be offline without serious impact to the business, a company can have backups of their data quickly restored on cloud servers that are 500 miles away, or even halfway around the world.

      Once a company has their hardware infrastructure in place, you need to create an environment where queries, analyses and machine learning can be done without impacting performance of the database for day to day business operations.

      If you’ve used products like Oracle or SAP, you are probably familiar with the concept of a “production database” for operations and a “test instance,” which is a mirror image of the database as of yesterday, for analysis.

      With big data, you are creating a similar repository for analysis, often called a “data lake” when you are storing data from multiple systems.  In order to pump data into the data lake, you need to build “data pipes”. Data pipes are scripts created by Data Engineers that pull the data (extract), convert the data into the best format for the data scientists (transform), and upload it to the data lake (load).

      These data pipes, called “ETL pipes” for the extract-transform-load process, are essential for data scientists to build models for machine learning or for data analysts to do analyses.  If you read reports of the high turnover and short tenures (less than 18 months) of data scientists, this is often the root cause and Achilles heel for companies. I’ve heard estimates that for every data scientist hired, you need two to five data engineers to support them: if the data scientists and data analysts can’t get access to the data, they’ve wasted their time joining a company that isn’t set up for success with big data.  (The demand for data engineers right now, as you would imagine, is even higher than for data scientists.)

      **************

      Related discussions

    • This is big data 101, and it's certainly an interesting topic, but what do you hope to accomplish by making this a panel? There doesn't seem to be a question posed.

    • Thanks for taking the time to reply and for the constructive feedback, Chris. I think there are several questions that I would love to discuss here in a panel format.  

      DS

      How prevalent is data science in your local ecosystem?  (You may have seen the report that came out last month that 80% of data scientists work at Facebook, Amazon, Netflix or Google.)

      Artificial Intelligence and Machine Learning

      Is the “iterate till it works” process the norm in AI?  (It certainly feels that way when you read the news of accidents with Tesla’s assisted driving mode.)

      Marketing

      What’s your experience in using data science/data analysis in marketing?  (A/B testing, Natural Language Processing, recommender systems)

      Data Engineering/Data Visualization

      What are the new tools worth considering over the established ones? (Pytorch versus Tensor Flow, Google Data Studio versus Tableau, Scala versus Python.)

      Hope the above clarity helps.

    • Big Data

      Thank you for the invitation, it’s been a long time since I last was here. Coincidentally I’m currently working on AI/ML related project and have been trying to learn more on the subject of Data Science as well. Here are some contributions:

      1) “How prevalent is data science in your local ecosystem?” - our project is in education, trying to improve all aspects of the E-learning process through ML. Doing a research of the current landscape doesn’t show many results. Coursera seems to be more active with some efforts in the website (recommendations, search results) and mail marketing, but that’s about where things end from what I have seen.

      2) “Is the “iterate till it works” process the norm in AI?” - I am no AI-expert but I think that an iteration-validation cycle is basic part of any software development with or without AI/ML. Unless you meant something different here. ML definitely includes validation of the model’s output as a crucial step.

      3) “What’s your experience in using data science/data analysis in marketing? “ - I haven’t had the chance to do something in terms of marketing use. As noted above, Coursera seems to be making some effort, periodically sending out updates about new courses based on previous participation in a course.

      4) “What are the new tools worth considering over the established ones?” - as always with software development, the choice of tools depend on the project requirements. I don’t think one should choose sides. What I find interesting is that we have 4 major companies (Amazon, Microsoft, Google, IBM) with a quite developed portfolio of AI/ML/DS cloud-based tools. In large part these can be swapped around or combined. It’s also very positive to see these companies trying to fill the education gap. Mostly IBM and Amazon are offering free/low-cost online courses that can help more people get into DS and AI. 

      I hope these comments help in any way :)

      Thank you for reading!

    • “Thank you for the invitation, it’s been a long time since I last was here.”

      I suppose that “Welcome back, MarkG!” then is an appropriate response.

      “Our project is in education, trying to improve all aspects of the E-learning process through ML. Doing a research of the current landscape doesn’t show many results. Coursera seems to be more active with some efforts in the website (recommendations, search results) and mail marketing, but that’s about where things end from what I have seen.”

      You may find useful this data science podcast on creating adaptive tests.  Instead of every student being tested on 50 items regardless of ability, the computer-based test instead provides increasing or decreasing in ability subsequent questions after each response until it’s determined your competency.  Some real opportunities for “natural language processing (NLP)” and other machine learning approaches. Hope you find it interesting.

    • Many thanks 🙏 for the link! Adaptive testing is definitely a valuable concept, not only to save time (as mentioned in the podcast), but also in providing variation and reduce cheating (while still being fair towards students being evaluated through different tests). In our project we are also working (using ML) on adaptive learning paths and providing additional reference material as means of accelerating the learning process and improving completion rates.

    • “I haven’t had the chance to do something in terms of marketing use. As noted above, Coursera seems to be making some effort, periodically sending out updates about new courses based on previous participation in a course.”

      For learning Big Data and Marketing, I’ve found that it requires a lot more digging to find useful case studies and education. 

      A recent interview I read with a data scientist at Quora on their A/B testing approach was fascinating: they have hundreds of millions of ML generated views so that each user receives a feed of content that is optimized to their behavior-exhibited preferences.  

      What are “behavior-exhibited preferences?”

      Quite simply, it’s recommendations based on what you do rather than what you say you want.  For example, the most interesting thing I learned about Netflix’s recommender system is that it makes suggestions based on what you actually watch, not what you add to your queue: all those intellectually curious documentaries that have been sitting unwatched in your queue for six months won’t fool Netflix that what you really want is more comedy flicks.

      *********

      Related discussions

    • Quite simply, it’s recommendations based on what you do rather than what you say you want.  For example, the most interesting thing I learned about Netflix’s recommender system is that it makes suggestions based on what you actually watch, not what you add to your queue: all those intellectually curious documentaries that have been sitting unwatched in your queue for six months won’t fool Netflix that what you really want is more comedy flicks.

      This is also described as explicit vs implicit ratings. Explicit means marking something as a 1-5 rating, clicking a like button, etc. Implicit ratings are deduced from monitoring people’s behavior. Illustration below from a quite helpful book on the subject: "Practical Recommender Systems"

    • Hi Stephen, Sorry for the slow response. I was out of the country for most of the last 3 weeks and largely offline during that time. Interest topic to kick around.

      It seems that monitoring our behaviors and extracting predictible patterns, which in turn can be converted into actionable insights that will drive future (buying) decisions is the holy grail of using big-data for marketing purposes. The discussion on explicit and implicit ratings touches on this. But where I think AI/ML/Big Data will largely fall a bit short is at the individual level. People are fickle. Statistically, what I did yesterday may have a strong influence on what I did today, but no AI or Big Data process will be able to monitor all of my behaviors and determine that I might wake up tomorrow and decide "today is the day that I am going to start doing X" or "ok... I'm kind of tired of binge watching netflix" or whatever. Maybe recommendation engines work well for many people, but I have yet to find one that works for me. Because I like one musical artist, does not mean I like some other musical artist that the recommendation engine thinks is closely related, because for me it may not be about "genre" but maybe about something else hidden deeper in the music. But I may just be an oddball.

      Where I see Big Data being really powerful is in sorting out problems that are ridiculously mult-variate. Better understanding how individuals in different populations will respond to different medical treatments, for example. Climate. Or for industrial applications where 2nd or 3rd order effects on individual variables interact to produce significant, but rare, ocurrances which can influence production efficiency, product quality, reliability, and safety. Marketing can fit into this category as well, across large demographics, predicting future trends based on current events, etc.. But I think it may be a fools errand to try to predict the behavior of Jane Smith as if she is nothing other than some algorithm that can be decoded given enough data, the right data scientists, and sufficient server farms.

    • our project is in education, trying to improve all aspects of the E-learning process through ML. Doing a research of the current landscape doesn’t show many results. Coursera seems to be more active with some efforts in the website (recommendations, search results) and mail marketing, but that’s about where things end from what I have seen.

      I thought the below graph was interesting. Manufacturing was generating 4X as much data as education ten years ago. Curious as to whether the ratio is still the same.

      Image credit

    • Although my understanding of data science is very limited, I hold a deep curiosity on how the use of data influences the human connection. My biggest concerns being how data is used by governments and business to influence the will of the users. Although there are countless success stories on the use of data to make our lives better there is a fundamental concern I see with a lack of user data rights and the manipulation of data through media outlets. Even though we have an incredible toolset before us I still believe we are a caveman trying to taste a burning branch struck by lightning. It looks awesome but do we really know what it is capable of?

      This notion of using data as a sword to disrupt industry is beginning to reveal itself a merely a means to monetize interests of users to build other industries off of. The hands wielding the sword grateful for such a favorable environment to sow their influence over users hungry for the carrot of convenience. So what does all of this have to do with your post? My contribution to this panel is to ask If you are to be a steward of the technology consider how the tech will impact the end user and possibly how your contribution will not only lead to the application of it but to possibly understanding the implications of it. Ultimately these technologies are not being built by huge corporations, they are being built by individuals who could eventually move the cheese to use cases favorable to the interests of the users.

      Thank you for inviting me to this panel Stephen. I have enjoyed reading the thread!!

    • But where I think AI/ML/Big Data will largely fall a bit short is at the individual level. People are fickle. Statistically, what I did yesterday may have a strong influence on what I did today, but no AI or Big Data process will be able to monitor all of my behaviors and determine that I might wake up tomorrow and decide "today is the day that I am going to start doing X" or "ok... I'm kind of tired of binge watching netflix" or whatever.

      Have you seen this ⬇️? It’s more than five years old, and I suspect the algorithms have improved significantly since then.

      They may not be able to predict when you’ll buy something, but there seems to be more and more data available to predict what you will buy.