Disappointed expectations aren’t the only danger of hype, though that danger is real enough. (Google Glass much?) Possibly a more damaging byproduct of hype are all the wrong questions that hype can lure individuals and organizations into asking. Artificial intelligence (AI)—especially deep learning—is sexy, exciting, filled with promise … and perhaps the last thing many organizations are ready for, despite their excitement. For many organizations, the right question isn’t “How do we incorporate AI into our business processes?” The right question is “Can we count?”

Crawl Before You Run, Count Before You Analyze

More specifically, the fundamental question for any organization starting out with analytics is “Can we count accurately?” This question comes from Monica Rogati’s excellent article on Hackernoon, “The AI Hierarchy of Needs.” In the article, Rogati points out that the foundation of analytics is counting: log events, user clicks, sensor readings, whatever.

Figure 1. The data science “Hierarchy of Needs”[1]

No data science can happen without counting, and a lot has to happen between mastering counting and mastering AI. Working up the Maslowian pyramid, these needs are:

  1. Can we, as an organization, count accurately? Can we count what we need? Are we gathering all of the relevant data we need (or want)?
  2. Once we can count, does our data flow reliably? Can we easily access the data that we have collected? Can we get fresh data as new questions arise?
  3. If our data flows, can we explore it? Can we cleanse it? Can we detect missing or erroneous data? Can we transform our data to meet our needs based on business logic?
  4. Can we perform analytics on our prepared data? Can we identify useful metrics and cyclic patterns in our data? More importantly (at least for AI and machine learning [ML]), can we identify and capture training data for our models?
  5. Can we, as an organization, build frameworks to incrementally build and improve ML models—and to track down bugs in our models before the models grow too complex?
  6. And finally, how do we incorporate AI into our business processes?

The common theme of all of the needs in the hierarchy is building out rational, maintainable infrastructure and processes to process and analyze data.

You Might Not Be Interested in Plumbing, but Plumbing Is Interested in You

As an introduction to his Medium article, “A Beginner’s Guide to Data Engineering—Part 1,” Robert Chang reflected on this disillusionment as his first job. As a newly minted data scientist, Chang expected to generate terrific insights for his company from their data. Instead, he was tasked full time with the very unsexy job of maintaining critical data infrastructure. In short, data science in practice turned out to be largely about the care and feeding of the data pipeline. Jimmy Lin and Dmitriy Ryaboy from Twitter termed this overhead “the plumbing”: the codebase and data transformations necessary to keep heterogeneous parts of an ML infrastructure running seamlessly.{2]

An old rule of thumb is that 80 percent of the work in any data-science project is just cleaning your data. Data scientists from Google are less optimistic, projecting that a mature system might be composed, at best, of 5 percent ML code; the rest is what they call “glue code,” the supporting code written expressly to get data in and out of general-purpose ML packages.[3] Add to this code-based bailing wire and duct tape all of other supporting services, tools, and other infrastructure necessary to keep insights flowing smoothly from data, and the proportion of ML code compared to its surrounding infrastructure can become vanishingly small (Figure 2).

Figure 2. Proportion of ML code compared to the surrounding infrastructure3

All of this infrastructure and code just to extract insight from data? What gives?

Partly, this bloat is the result of technical debt incurred early on: easy data science wins and ad hoc investigations that struck pay dirt and were put into production with little thought given to long-term maintainability. Partly, too, this what Rogati might call data illiteracy: the misunderstanding by an organization of the role its data plays; that is, the use of data as an add-on, and not as a fundamental part of business processes.[4]

All of this discussion brings us back to asking the wrong questions because of the hype-of-the-month. “How do we incorporate AI into our business processes?” is not the right question for all but a handful of organizations to ask. “How do we build our processes to incorporate data?” is probably a better one.

The need to think of data as a strategic business plan extends to IT. Samson Hu describes why this is important in “Building Analytics at 500px.” When he initially joined photo-sharing/stock-photography startup 500px, all relevant data was split between user-activity log information residing in Splunk and photo- and user-state data living in MySQL.[5] Due to this arrangement, an analyst had to write custom code to pull any data—and Hu was the only analyst at the company. The solution Hu eventually engineered was a SQL data warehouse housed on Amazon Redshift, with a front end that enabled Hu’s colleagues at 500px to pull data on their own.

You’re Going to Need a Data Warehouse … Eventually

Chang related, in his account of his career’s growing pains, that there is a direct correlation between the maturity of an organization’s data infrastructure and the insights that data scientists can draw from that data.[6] Part of this maturity is having a central location where different kinds of data (such as log, sensor, or database data) can be housed after being extracted, transformed, and loaded into forms that can be used by analysts or ML code.

Remember that ultimately a data warehouse is less about the size of the data stored within (although data warehouses can grow quite large), and more about the purpose of that data. A data warehouse is ultimately just a database that differs from your other production databases in that you want to hide complexity (hence the need to extract and transform data before putting it in there). It is also a denormalized database because read performance is much more important than write performance.

A lot of the technical debt that can creep into ML and AI-driven analytics initiatives stems from trying to race to the end. But the journey to advanced analytics, like any other bit of IT engineering, is an incremental strategy that requires solid processes and best practices built into each step. Any size organization can start down this path, because every organization can—and ultimately must—keep practicing good fundamentals to arrive at better analytics.

For more tips, read our data analytics blog posts. You can also follow Prowess on our blog, Twitter, and LinkedIn.

[1] Rogati, Monica. “The AI Hierarchy of Needs.” Hackernoon. August 2017. https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007.

[2] Jimmy Lin and Dmitriy Ryaboy. “Scaling Big Data Mining Infrastructure: The Twitter Experience.” SIGKDD

Explorations. 2013. www.datascienceassn.org/sites/default/files/Scaling%20Big%20Data%20Mining%20Infrastructure%20-%20The%20Twitter%20Experience.pdf.

[3] Sculley et al. “Hidden Technical Debt in Machine Learning Systems.” Proceedings of the 28th Neural Information Processing Systems conference. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf.

[4] Rogati, Monica. “Data Literacy, Product Design and the Many-Faced God.” October 2016. https://artplusmarketing.com/data-literacy-product-design-and-the-many-faced-god-cf8339e035a.

[5] Hu, Samson. “Building Analytics at 500px.” June 2015. https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83.

[6] Chang, Robert. “A Beginner’s Guide to Data Engineering—Part 1.” January 2018. https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7.

Share this:

FacebooktwitterlinkedinmailFacebooktwitterlinkedinmail