Welcome Ahmed! So to kick things off; what are some common misconceptions around data that you wish more people were aware of?
Welcome Ahmed! So to kick things off; what are some common misconceptions around data that you wish more people were aware of?
I think the biggest misconception about data is to do data analysis you need clean data. I think that every person in the world says “data is messy, data is terrible, I can’t make decisions off of it.” But cleaning up data is a lifelong endeavor. You will ALWAYS be cleaning up data. It will never get clean! So I think that the question should shift from “is my data clean” to “How do I structure my data so that even if it’s dirty I can still use it to make informed decisions.” I always make a joke that “no company has one lead per person in Salesforce. It’s always four leads per person, like 5 contacts, there’s a lot of duplication in the nature of that.” So asking a simple question like “How many leads converted” Is a bit harder in data unless you structure the data: so the question shifts to “Given everyone’s first-ever lead, what is their first-ever conversion” and that guarantees all the nuances and the duplications can be cleaned up in that question we are answering.
With your own personal journey as a data science professional, did that inspire you to create a tool that you wish existed through Narrator.AI?
Oh, yes. So I came from a background of AI, and everyone in the world of AI uses time series tables, like data modeling. And that problem doesn’t exist when you’re just doing algorithms. In the startup community, every question you have to answer is very difficult, because of the vastness of the sources and how unusable the various data structures are. So to give you an example, our data structures that we have are designed for machines. You have a ticket object, and the ticket has a status on it, and that ticket has a bunch of comments on it. It’s designed to be about the ticket. And when people ask questions, they don’t ask questions about tickets. They ask “How many people visited the FAQ page, then submitted a ticket right after?” That question takes, like, HOURS, if not days, to do by hand. And it seems simple! Because you’re like “Duh, I can see I did it!” But computers are independent systems with independent data with independent structures, so that simple link is actually very hard to bridge. So that is what inspired me to build Narrator.ai. I was sitting there, doing that work, and knowing that there’s a world when you structure information better you can answer questions faster. So instead of spending a day on it, you’re spending a minute. So that’s the first biggest component. I felt there was a lot of wasted time analysts spend on dealing with data-structured bullshit. Every 2 years, people think of initiatives to re-design the system, it appears good, and then as time goes by, it quickly fails again. We are all in that world reinventing the wheel. So the whole structure of Narrator, a standardized globalized schema that we know will work when you’re $100 billion dollars or a 2 person startup, sets you up for success. And if I had this product during my time when I was working as an analyst and data scientist, I would have been 100x more efficient.
Can you walk me through how Narrator.AI could help solve a data need for a business, enterprise or organization? What would a sample usage case look like?
Great question. So we just had a question a couple of days ago, where someone was trying to understand their lead source - whether someone comes from social, search, or paid - are they more likely to convert from lead to sale. So later on in the funnel. So somebody comes into the website, from search, two months later are you more likely to convert to a sale?
So you’re bridging a couple of sources here. Because of our structure, our core technology that we have enables us to deliver very rich analyses that we can share amongst all our companies. So Narrator.ai, we can share analyses outside of a data modeling solution. So because you can share analyses, we can spend a lot of time designing these to be very thorough and detailed, because we’re solving it ONCE for everyone.
So when we deliver that analyses for that person, in the case we’re talking about, we do a LOT of stuff. We clean up and figure out consistent times, to make sure time variants and those various behaviors don’t interrupt the data. If you’re trying to figure out if something converts better or worse, if some months you had a really good conversion rate, while others you have a bad conversion rate, we have to remove those errors and variances from our data so that we can make better decisions if it matters or not.
Then we have to check not only if it’s statistically significant that search is better than social. And in this case, search did convert significantly better than social. An average analyst would stop there. But Narrator.ai is able to go deeper.
So now not only is it statistically significant, but has it ALWAYS been better? Is it possible at one point it spiked, or over time has search always been significantly better than social?
So then in this customer’s case, it was. 93% of the time search was better than social. A senior analyst would stop there. But we go even further.
So now when you get to this level, it could be the case of something we call “correlation vs. causation:” Is it something about the search and social, or is it that there’s something else that happens to be driving this impact? So we dive even deeper, because we want to give you an actionable recommendation.
So the actionable recommendation, if we stopped there, would say “Take people from social, and move them to search. Spend more on search.” But we’re going to check now in the past when we shifted the people to come from social or search, was there a causal effect for the conversion rate to move in that direction as well?
And it turns out that there WASN’T. So in all of the history of this company, every time you shifted people to social, it made no impact on the overall conversion rate. So this whole thing was a correlation. So the action was “don’t do anything.” This feature doesn’t affect this conversion rate at all. It happened to overly it, but it wasn’t a driver, so changing it didn’t matter.
In the case of this client, we recommended that they NOT shift their distribution. There’s something else driving the impact. Don’t start shifting your money, because it will have no impact. Instead, we’re developing a new hypothesis. And the fact we can iterate through hypothesis and analyses so quickly - that would have taken a senior person a LONG time to run. And our system is smart enough to generate it with the reasoning.
If someone was inspired by your career and wanted to start working in data science or data engineering, what would you recommend they look into as far as studies, internships, or concrete next steps?
Great question. So: I will say there is a set of basics for analytics you should know as far as data science. And some basics for analyses. So for data science, I think you have to start learning about Bayesian Mathematics. Bayesian Mathematics, at its core, is just the idea of given measurements and information, we can improve our hypotheses. It’s saying that every time I observe something, I can add information to help me make a better decision. It’s a mathematical way to do that. The other major thing about Bayesian mathematics is it’s heavier on time series, so in Bayesian mathematics, time is often a control. Not a feature. So what that means is we assume that everything is happening in time. This is SUPER-different than a lot of other practices, because time is a feature in those places where you look at tenure, or how long someone has been a member, versus realizing that in time they are changing. And in a startup and today’s age, everything is changing in time at such a rapid place that if you don’t take into account time you are often blinded by the results.
The second dimensionality is analyses. Analyses is learning about variance: how do you reason about uncertainty in the data. When we talk about the analyses we gave, it’s very easy to look in aggregation and see a very big impact. As you dive deeper and deeper, remove variability and time, remove inconsistency, and look for causality, the picture often changes. So most analysts are trained to look at the top level, and ignore the rest. And it’s really dangerous.
The final thing that I would say is: if you cannot explain your analyses, and how you got to the conclusion you made, to a fifth grader, then it’s not worth it. So often you see in a lot of AI tools, and they say “men convert better than women” and you’re like “Cool, how did you get that, why? Am I going to change my whole business? How did you come up with that conclusion?” And they say “Trust us, it’s AI.” That’s BAD. You can’t work with that. You can’t use it! It’s never right. To give you a really good example of the dichotomy of our approach versus the black box approaches, I was working with a company, and we were trying out a big AI product by one of the Fortune 500 companies. And the company said “Our AI solves your churn problem!” And we’re like “What do you mean, solves our churn problem? How did you figure that out, as it’s a complicated problem?” And we go into a meeting, they show us the results of what the AI decided. The answer they came up with was “when someone pressed the cancellation problem, that’s the biggest indicator for them leaving.” This is a real story. The computer can’t reason: this is the dilemma with black box approaches. So learning to explain and analyze is key. And only using approaches that can be explained.
What are some data challenges you find particularly fascinating?
I think one of the most difficult and interesting data challenges is the ability to translate a question into a data question. I think that structuring questions to BE data questions is an art, and it sets the foundation for a reliable analyses (or a non-reliable analyses). For example, if you ask me a question: “Does featuring a topic affect the likelihood for someone to search and find and engage with a topic?” So this is gonna say “Given adding a Featured tag, will people who search for a specific topic and see it, are those people more likely to engage with it?” Notice how that question already has so many different dimensionalities, to break down and identify the behavior you’re impacting versus not impacting.
A bad translation of that would be “Are topics with more featured getting more views?” That’s a very bad translation of that data question. Because when you answer the second one, you have no idea what to do with that. There’s a million things that could drive that, and you have no idea what they are. So this doesn’t drive you to an actionable result of “Should you feature something or should you not?”
If you want to make an action out of the question that you asked, then the high level question, which is “Do posts that are featured get more views” doesn’t tell you if you should tag it or not. The first question helps you understand the impact of tagging something as featured. So that’s a big thing I would tell people to do, is to translate your questions into data questions accurately. This fascinates me.
The breakdown in actuality is ignoring everything the person says and trying to identify what action you’re trying to take. When you asked me a question of if featuring affects something, what you’re asking me is “Should I tag my posts as featured?” So that has to be broken down to “Given if I tag my posts, am I more likely to get them viewed, versus if they are not tagged.” It becomes a nuanced data question so I can tell you whether or not to tag your posts. Often, data analysts outsource that work to YOU, so that you, without seeing any information, guess whether it would work or not. And you end up trying things, guessing, and blindly working. And that’s not good! And this is going to be what we tackle next: at Narrator.ai, if we can make your analyses super-quick, then you can spend all your time talking to your customer to try to understand that question. And eventually, we’ll hopefully tackle this human-translation problem.
Yes. When I was in college, I worked in a robotics lab that was responsible for an autonomous car. And my professor was, I remember he taught me the basics of AI. I was designing robots at this time, and there was a specific moment, I remember, where he said “Pick up a cup.” And he asked me to try to explain to him programmatically “How do you pick up a cup.” And that’s a REALLY hard problem! To pick up a cup, a robot or a human being uses our stereo vision to create a measurement of how far a cup is. Then we take in all our muscles we have in our body, and solve a very complicated optimization problem to figure out how do we use the least energy to get close to the cup. As we move our hands toward the cup, our eyes recalibrate and give us feedback until we hold it. Once we grab the cup, our fingers detect the force on the cup, until we have enough force to pick it up. And then there’s a whole lot of computer vision also to figure out “What is the cup” and “What shape is the cup.”
And everything here has error! So that moment is when I began realizing that our brains can be modeled in math. And there’s so much beauty in the complexity of that model. And I was hooked ever since. So everything we do, if you break it down to its components, it’s us humans solving all these math problems in our heads magically, reasoning about all this uncertainty, to act!
Wouldn’t it be awesome if we could understand that better?
So the beauty of understanding that better inspired me to dive into studying robotics, so I studied robotics!
Great question. So we touched upon it in an earlier conversation. I’ll break it down to 3 common problems.
One is assuming the data is clean.
I mentioned that all data is dirty. You have to figure out how to ask that question with the data being dirty.
Two is assuming the question itself is correct.
Which is every question is trying to get at an action. And often, the question itself has to be rephrased multiple times until you can find the right question to guide an action.
And the third problem is time.
Because doing an analyses right is super-time consuming, due to modeling issues, structuring issues, query issues, and math issues, we get excited and we are not able to dive as deep. So analyses, people do phDs on these analyses, and they spend months trying to isolate and understand. In an environment when you have days or weeks to come up with an action, you end up making mistakes, unless you have some support or tooling to help accelerate that process.
And I’ll add one more: the last thing is the illusion of self-service. An untrained person is not able to look at a dashboard and figure out what they should do. I had to go through robotics training, so I could make decisions, to learn how to understand data. And often anyone thinks they can make a decision by looking at a dashboard. I don’t believe there’s ever been in a history of decisions one that someone found on their own through looking at a dashboard. You have lucky situations, but that dashboard doesn’t necessarily say whether it’s good or bad. So let experts do the analyses, and you consume them. Everyone wants to do self-serve analytics, but they may not be qualified to do that.
Haha! So I happen to know a little bit about this. At one point, I think it’s still the case, adding a show to your list makes NO difference. I’ll tell you why: what Netflix figured out was that when people added shows to their list, they’re being overly optimistic on what they’re gonna watch. People add documentaries and things, stuff that if we were to use that data in the recommendations, you’re LESS likely to like the recommendations. So Netflix, similar to Spotify, uses your behavior. At its core, what they do is, based on what you watch and in what order and how often you’re watching something and not finishing it and the whole idea of your behavior, it can group you into another category of people who are like you who’ve watched nearly the same movies, and what are other movies those people have binged, or that have made Netlfix more successful. There’s a cost component to the recommendations: Netlfix makes money off of every view, so they are more likely to recommend a Netflix original, versus a licensed show. They must make a trade off: what is the best show you are most likely to finish that will make Netflix the most money based on your behavior and people who have similar movie taste as you, their behavior as well.
There’s always a cost-price component when you deliver a marketplace. As you get more and more into the world of algorithms and data, there's a lot of fascination for me in how you make a decision.
If you give people what they SAY they want, they’ll be really mad at you. So you have to give them what they ACTUALLY want, that they didn’t communicate.
It’s the same as in data: everyone says “Oh, promos will help! Best prices will get sales!” And that’s sometimes. It depends on your business, your customer’s behavior that you want to change. The idea of best practice, or average, means sometimes it works, and sometimes it doesn’t. So you should figure it out for your business before, and then ask based on what drives that for your business. For Narrator.ai, that’s our bread and butter. There’s a reason why every analyses is run on your data. If you have a call center, and somebody calls you, should you pick up the phone? Or send them to voicemail? It seems obvious, but more often than not, the answer is surprising. It might be the case that calling increases conversion, but answering a phone call doesn’t matter, because the act of calling is showing interest. And whether you pick up and sign them up, or if they go online and sign themselves up, that might not affect it at all. There’s a lot of nuance. Your intuition is good, but we make decisions based on your past behavior.
What's the best way we can stay updated with you?
I’m not on social media, but if you submit your email on Narrator.ai, then we send you our biweekly newsletter updating you on what the company is doing and where we’re headed. And that’s the best way to keep in the loop. Or email me if you’re excited! My email is email@example.com and I have no problem talking to anyone if they’re excited about these topics. Eventually as we grow, I want to create this community for analysts who can share their work, and that data analysts can be similar to engineers on open-sourcing, and that’s where I hope to bring things to in the future.