Causata

Causata Blog

Integrating web data: the first step

Thursday, 16 December 2010

A hot topic at the moment is connecting web behavioral data to “offline” data –everything else that you know about your customers stored in legacy and backend systems. Clients tells us every day that using data from the web to inform and improve offline decisions is now a priority.

Many organizations have developed their customer analytics in silos with separate data infrastructure in different parts of the business. It’s common that the dotcom group in a company is so detached from the traditional offline part of the business that no data is shared at all. Not only are these companies not harnessing the power of the web data outside of the dotcom group, but they don’t have a good idea of which customers, are doing what online.

This can be the case even if they’ve got a well-managed CRM system and a sophisticated web analytics team. Web analytics systems are designed to show aggregated information across web site visitors, rather than store information about individual customers and weren’t designed to be connected to other systems.

At Causata we believe that connecting web data is essential.

Combine web and offline data

With all the predictive power of web data there are a lot of cool things you can do because web data gives you customer intent. Customers are literally telling you “I like this TV”, “I’m interested in accessories”, “I’m in the market for a loan”, etc. You should to be listening to this information and acting on it quickly or you’ll lose out on opportunities to help customers and grow your business. We’ve seen again and again that online behavior can provide much more predictive power than traditional customer segmentation when it comes to understanding key events such as product purchases and attrition.

The conversations we have with clients generally focus on how they can understand why their customers behave the way they do and which customer actions can be influenced. There are huge opportunities for gains from linking data. There is, however, a basic step that nearly everyone can and should take right away.

If you’re a company that takes web data seriously then you probably already have access to some useful data. While many Fortune 1000 companies are experimenting with Google Analytics, which is free, about 40% of the Fortune 1000 care enough about web data to purchase a paid web analytics tool such as Coremetrics, Omniture, or Webtrends.

40% of Fortune 1000 companies care about web data enough to use a paid web analytics solution
If you’re using a paid web analytics tool you’ll probably have already marked up your web pages to capture key interaction information for reporting purposes. Most of these tools will also let you access months of historical web data in batch form. It’s possible to combine this with your offline data.

The first step for integrating web data is to connect this historical batch data from your web analytics tool to your offline data. The challenge is to find a common key that will identify customers so you can join the two data sets.

Ideally, you will already have a Customer ID or Transaction ID of some sort in your web markup that you can use as a key to link the web data to the offline data. Otherwise, you’ll want to consider a few small changes to the website – typically a single field added to a handful of web pages is enough to get started. The benefit is huge because once a user identifies themselves you can understand their browsing history with your offine data. This data matching process can be improved further the more keys you add: things like a Cart ID on a checkout funnel or an email address when someone clicks through to your site from an email campaign (among others) can be used to merge offline and online data.

Once you have the online data connected it can be very enlightening. Suddenly it’s possible to see how people behave online, how online behavior differs according to different customer segments, and how web site activity relates to other offline activities in a store, a call center or anywhere else.

Chart of predictive factors for loan application

Of course, connecting the batch history from your web analytics tool to offline data isn’t the be-all-and-end-all – with the web you really want to be working in real-time – but it’s a great first step that everybody should be taking… now.

Gareth

Learn more about Causata Technology

More prediction, less reporting. Freeing up an analyst’s time so they can make a greater impact

Why do so many organizations struggle to wring value from their data? For most the problem is not data quantity – it’s relevance and connectedness. Much of the data scattered across enterprises is practically worthless until it’s processed and connected in one place. Indeed, expecting valuable insights to spring from your data is like expecting a pile of coal to turn into diamonds. Raw ingredients have little value without the right conditions.

Irrelevant, unmerged data is so common that data scientists, analysts, and modelers spend much of their time trying to squeeze it into more valuable forms. Among the people I’ve asked it’s not a stretch to say they spend over 90% of their time on this. They’ll also tell you they’d prefer to be spending their time building predictive models because they know that’s where they can create most business value.

This picture gets worse when you start to tackle web data. Web data is notoriously sparse. Real-data lacks the pretty normal distributions people see in textbooks and this is only amplified on the web. Consider the example below showing page views: it may seem odd but most customers have none because not everyone is using the web yet, and the most active online customers with two or more page views are a small fraction of the overall population. Are these active online customers making more purchases across all channels? You can’t answer this question without connecting your data at the customer level across channels.

Chart of visitor count in the last 90 days by number of page views for each visitor

All the work to collect the web data in the first place, structure it into a dataset that can be used in a statistical tool like SAS or R, etc. can take a lot of time. The world would be a much better place if some of this effort were done automatically for a data analyst. Causata was designed with this in mind.

Causata automates the collection and connection of real-time web data so that it is structured around customers and can feed predictive models. Rich markup that characterizes customers is placed on web pages. This web data is combined with offline data and stored where it can be accessed easily using a unique identifier for each customer. Causata provides a large set of predefined variables suitable for customer analysis and modeling. More variables can be added as needed. These variables are constructed at query time from raw events so there is enormous flexibility for an analyst in terms of adding or changing variables at any time.

Causata displays the data graphically allowing an analyst to easily understand the distributions of the variables across any set of customers and also which variables have the most predictive power for a particular outcome. From there a dataset suitable for modeling can be exported to SAS/R. This means datasets for modeling can be created in minutes rather than weeks.

Quote: allow data analysts to spend more time on high value tasks
A huge portion of the end-to-end effort of creating value from customer data has been automated in Causata. If you’re an analyst, suddenly you can spend much more of your time on what you enjoy. You can actually explore the statistical models you want, put them into production and give your results a big boost.

Companies need to shift the analytical emphasis from reporting to operationalizing the power of predictive models by embracing opportunities to allow data analysts to spend more time on higher value tasks. There are so many places in a business that stand to benefit by being a lot smarter with data - marketing, personalizing an online experience and service are just a few examples.


Gareth

Learn more about Causata Technology

Simpson's Paradox meets marketing

Wednesday, 1 December 2010

Part of Causata's name is drawn from the word 'causal'.  In delving into masses of customer data Causata's goal is to surface causal effects.  To do that Causata keeps incredibly granular data about events (any customer interaction) along a timeline.  By examining events across time and controlling which people are exposed to particular interactions (such as who sees certain content on a web site), Causata teases out causal effects.

One of the surprising things when dealing with the data is how counterintuitive some results can be.  A great example of this is Simpson's Paradox (it's not strictly a paradox, just very unexpected).

Imagine a situation where two pieces of advertising (called Version A and Version B) have been prepared for use in print and on the web.  The advertisements are placed and the percentage of people responding to the ads is measured.  From this data a simple table can be drawn up showing the number of responses, the number of impressions and the response rate for each of the versions and media:

Version AVersion B
Web1800/130000 (1.38%)500/40000 (1.25%)
Print750/40000 (1.88%)2200/130000 (1.69%)

It's clear that Version A outperforms Version B in both print and on the web. From that it would be obvious to conclude that Version A 'works better' than Version B. But there's a problem.

When the results are combined a different picture emerges:

Version AVersion B
Web + Print2550/17000 (1.50%)2700/170000 (1.59%)

Here it's clear that Version B produces a better response rate. That is Simpson's Paradox in action.

In this case this change occurred for two reasons: there's a large difference in the number of impressions each advertisement received (perhaps because of different sized print runs for a catalog or a difference in targeting on the web) and there's a confounding variable (a variable that needs to be taken into account when interpreting the results).

Here the confounding variable is the medium (web or print). Notice how for both versions of the advertisement print is more successful. For whatever reason people seeing the ad in print respond better to it than those who see it on the web. The choice of medium and the disparity in the number of impressions together mean that the wrong conclusion can be drawn if data is mixed.

Simpson's Paradox occurs because of an incorrect interpretation of correlation and causality. When looking at the combined data above it's tempting to say that Version B works better than Version A. There's a leap from the numbers to the conclusion, and because the confounding variable is ignored in the combined numbers its effect is also ignored.

Of course, the opposite is true. Version A is better than Version B. Once the true causal relationship is examined (the effect on response rate of versions A and B for web and separately for print) the true answer is revealed.

If you want to dig deep into this read Simpson's Paradox: An Anatomy.

At Causata we are mindful of Simpson's Paradox and its effect on conclusions drawn from data and are careful that our system is not itself fooled by this simple but tricky problem.

John.

Labels: