Summary: This is written for executives or managers who are looking into making sense of your corporate data and using it to help your organization going forward.

How do you start down the path of using data to actually help you -- maybe using tools like “big data,” AI, or machine learning? And what is the associated overhead that you need to put up with in terms of consultants, PMPs, new employees, databases, data systems, servers, and, of course, the cloud? 

These articles will talk about the challenges involved -- and what needs doing in your organization. 

Some of how you use your data will depend on what questions you want to answer -- or what decisions you want help making -- but some of it boils down to a standard recipe:

Data → Analytics → Results

Like any recipe, however, if you skip steps and cut corners, what could be awesome becomes a toxic horror.

At the same time, it’s possible to make a totally decent meal without a Michelin-Star chef and caviar. So where does that leave you? How much “kitchen equipment” and expertise do you need before you can start taking advantage of your data?

I’ll try to demystify some of this. Read ahead with a sense of humor and forgiveness for yourself and your staff! I’ll summarize my tips at the end of each section with a ☝.

What’s your objective?

The old saying of “Pick two: Quick, Cheap, and Good” -- completely applies. Having some help formulating your questions can be enormously useful. We consultants are only human -- we like perfection. If given enough time and budget we’ll try to create it. However, are we solving the right problem? Without bleeding you dry? Let me elaborate with an example:

Manager:  Find me all my customers who owe me money.

Consultant: OK, should we try to resurrect your old data system to try to pull information out of there too? For the right price and six months, we can!

Manager: Fine, something easier -- just list all my current customers, and tell me when they were last active on my system.

Consultant: Hmmm. Who’s your customer? Is it The World Anti-Poverty Centre? Is it the NGO that they’ve funded to buy your product? Is it the remote village that’s using your device? Is it the doctor in the remote village who specifically uses the device? Is it his assistant who has re-ordered recently? And what determines if a customer is current?

Manager: OK, I get this. Just give me the major organizations, how much they owe us, and the person to contact for clients who owe us money since January…

Consultant: Well, at the World Anti-Poverty Centre, the president is Mr. Chen Li (or maybe that’s Mr. Li Chen?), but we were really involved with Dr. Giorgia Tozzi in their sustainable food group… However, the NGO received the grant from them, so they’re really your customer?

☝ You’re in business because what you do is not trivially easy. Even answering “basic” questions about your business probably won’t be simple. It’s really important to share your overall goals with your analysts. Your data analyst or consultant can help your refine your questions and how you ask them to save time, hassle, money -- and get the results that are timely and important to your work.

So where is your data?

Data lives and grows where folks currently use it. It takes time and effort to centralize it. Every company I’ve worked with has their data in a million places. Maybe your organization has data:

  1. In Excel spreadsheets with the finance team
  2. On the sales guy’s notepad or laptop
  3. With your manufacturing team in Shenzhen in their database (and it’s in Chinese!)
  4. Locked away in an enterprise system like SAP, Oracle, PeopleSoft, or SalesForce
  5. Trapped in HR’s confidential server
  6. On Survey Monkey’s website (in somebody’s personal account)
  7. In your knowledge management system (KMS)
  8. In your training system
  9. In your employees’ email (“just waiting for ___ before I can send you a contract”)
  10. In the Word docs that your interns compile every quarter

In most companies, none of these systems communicate with each other. Plus...

If humans enter the data, it’s frequently duplicated, missing, fragmented, outdated, and non-standard. 

If machines enter the data, there’s usually a ton of it, it’s often not-unique, and it’s probably cryptic. 

You’re not alone if you’re concerned that you have data problems!

☝ Step one is to make a list; see what data you have, and where it is. This is a bit of archaeology -- it’s probably going to be ugly -- like going through your stack of important papers or unread email.  When you finally do it, you discover that you forgot to write back to so-and-so, add that business card to your contacts, use that gift certificate, etc.. 

One way to make this list is to have somebody walk through your business and talk to your employees, “What do you do? What do you use to do that? Where do you record what’s been done? Is this the same for every type of customer?” Sometimes an IT department may have a partial list of data sources. That’s a great place to start -- but walking around gets you much further!

Step two is to start centralizing your data, or making sure that your data systems can communicate with each other. Usually this means a database and writing software to mesh everything together. Once your data is all in one place it’s amazing the sorts of information you can quickly learn. (sometimes you just learn that you can’t trust any of it!)

What is in your data?

Your data is probably a mixture of numbers and free-form text. How good is it? If your company is like any other company I’ve worked with, even listing your clients will show you that there’s work to do. 

Duplicated data (with old/wrong information)

Let’s ask a sales operations person for who’s in their report for JP Morgan Chase Bank (formerly Chase Manhattan, then Chase Bank, then JP Morgan Chase…) we find:

  1. Chase Bank
  2. JP Morgan Chase
  3. JP Morgan Chase Bank
  4. JP Morgan (10131) if somebody includes your customer ID number
  5. CHASE BANK
  6. chase
  7. james (chase)
  8. Chase (lead)
  9. Chase/Bear Stearns (old)

Your employee says, “I know all the right information is in the 4th one, the rest are outdated.”

In addition, I bet your communications or marketing team has their list of customers -- “Sue Smith, VP of Development at JPM/Chase” And your service group does too: “Ramesh at loading dock 4 (in the back, between 8 AM - 11:30 AM except on Saturdays).”

☝ Computers are good at doing the same thing over and over. However, they can’t decide what’s worth their time and what’s useless. Before you start trying to use your data, your team needs to remove duplicate records or it will throw off your results. And going forward, you team needs to be diligent about maintaining the integrity of your new data.

Fragmented Data

I worked with a company that wanted to automate reports on their employee training records. Instead of querying “the system” their training department had set up one “system” (a “filing cabinet” containing training records) for each group in the company. So in order to get everyone’s, you had to make separate reports for each group and then combine them. Making this process more complex, some employees were trained with an old system, so to get their reports, they needed to get IT to boot up those old computers and get those reports. Combine with the new reports…

I guarantee you that this wasn’t done frequently. And when it was done, people made all sorts of mistakes.

☝ If you want a complete vision of some or all of your organization, you need to make sure that all the data is in the same place -- or that the relevant systems communicate with each other electronically.

Non-Standard Data

Another client of mine was an IoT (Internet-of-things) vendor that sold vehicle tracking devices. Interested in tracking vehicle performance they asked their customers for details on their vehicles. They stored the responses in their database:

  1. 6.3245 m
  2. seis metros
  3. ছয় মিটার
  4. 134
  5. 388 m
  6. Blue, 25’

Can you imagine trying to do fuel calculations on “blue” vs. “134”? Needless to say, we ended up throwing out all this “data” and starting over with “Please enter the length of your vehicle in meters:” And if we received a length that was bigger than a football field, we triggered an alert to follow up with the client…

☝ This is a great example of the change management or re-engineering of the process that usually needs to happen when setting out to work with data. Computers are great with numbers, or even categories (e.g. “good”, “bad”, “ugly”), but mixed data and numbers or free-text fields tend to be difficult to work with.

Out-Of-Date Data

Is your data accurate and up-to-date? For a doctor-locator feature on a website I worked on, our team was frustrated by our sales group writing down incomplete information and never bothering to update it. Does Dr. Smith practice at more than one hospital? Does she have a private office? What’s the phone number? Is that Hotmail email address still valid? Is her name Jenny Smith, or is she Dr. Jennifer M. Smith? Are the office hours still correct? These details are excruciating to track, but may be vital to your efforts.

☝ If your analytics rely on contact information or other information being up-to-date, you need to make sure a process gets put in place to double-check and correct your data on a regular basis. 

Missing Data

If I don’t talk to my wife during the day, she assumes that I’m going to get groceries and make dinner. That’s fine, unless I assume that she’s going to get groceries and make dinner… In the corporate world, people don’t fill in information for any number of reasons -- the site was down, a form was annoying, they forgot… But if your histogram of customer data has a big bucket of “Other/Not-Recorded” it makes the analysis less useful. Are you sure you know your customers?

☝ Making sure to visualize your data is a great way to double-check your data. You can have a great trend-line analysis, but how much of your data is missing? Make sure to visualize what percentage of your data is missing or questionable too.

Non-Unique Data

Finally, I have to mention a common problem, how do you uniquely people?

  1. First and last name? How many David Lee’s can you have? Or is it “Dave”?
  2. Email address? What if they change company/group/maiden name?
  3. Phone number? What if they change numbers?
  4. Government ID? What if you want to operate in a country that doesn’t have this ID?

☝ I think it’s important to go through the various types of data with a data expert -- not Bob from your team who taught himself to use spreadsheets, Microsoft Access, and make a website. That’s like asking your brother to fix your car. It might be good. It might happen within a year. Or you could end up with something that kinda works, but you just don’t totally trust it. 

We data experts might not have all the answers, but we’ve probably seen something similar before. And, seriously, Bob?

“Well, we have a system in place...”

Even if Especially if you have a fancy system like SAP, Salesforce, Oracle, NetSuite -- I’d suggest migrating your data to some other “analytics” database. Now’s a great chance to see what you really have by moving the data and thoroughly inspecting it as you go. My guess is you have many of the same problems as I’ve listed above -- except they’re hidden by an inscrutable expert system or a gorgeous webpage. Unfortunately, this is often a lot of work and expensive. Depending on what you need, you can try to fake this by exporting reports. To be honest, I’ve had about 50% success with doing this. Sometimes it’s good enough. Other times you need to open the lid on your system.

Do I have Big Data or do I need to think about it?

“Big Data” can be defined as the 3 V’s -- Velocity, Variety, and Volume. Think about Amazon.com. Data arrives from their websites, service centers, tracking of parcels, delivery trucks, suppliers, publishers, etc. at an enormous rate all day long. The variety of data is astounding -- it’s not like a finance application where every stock has a ticker symbol (AMZN), a bid price, and an offer price. Some of it is “this customer just viewed this item,” some of it is “that customer wants to cancel their Prime subscription, some of it is “this order was just delivered.” And finally, with all their transactions, the data is simply too big for a standard database system.

If you have a big data problem, you probably know it. Otherwise, your data is probably small or medium-sized. Even clients with thousands of devices, each reporting in once a second, often have medium-sized data. You may still have issues dealing with the variety of messages that are important to you, or the speed at which they arrive -- but these are solvable problems without jumping to Hadoop, Big Query, or other big data tools.

And if you do have big data problems -- fortunately there are tools available to help you.

☝ Hundreds of thousands, or even tens of millions of records doesn’t mean you have big data problems. But usually only if you have big data can you effectively use many machine learning tools. These pattern matching methods require lots of training data, otherwise they can be too specific -- failing to recognize the patterns you want them to, or seeing patterns where they aren’t there.

“That’s not a duck: It’s only 15/16ths of one!”

“Obviously a duck.”

What next?

Depending on the size of your data and organization, it often takes a few months to inventory and reorganize your data so you can start answering your questions. It will take somebody going through what you’re currently doing, finding where all the gaps in your processes are, and fixing them. At the same time, you’ll be moving and centralizing your data so it’s easy for a computer to query. Both of these tasks require people -- sometimes one person can handle all of this. Sometimes you'll have a team of IT folks dealing with the data, and others managing the project with more formal project management techniques. At the end of the day, the results are important -- so it's critical to constantly be checking and validating your work.