Data Fusion: An Intuitive Argument

Data fusion is defined as the process by which two or more different respondent-level databases (such as a television audience measurement panel and a consumer survey) are matched together to form a respondent-level single-source database with information from the original databases.  In reviewing the literature on the subject (see our Data Fusion section), it is apparent that this subject is a vastly complicated matter that involves a great teal of mathematical tools.

Regardless of the mathematical sophistry, the essence of the problem must be whether the basic available permits the data fusion to succeed.  Data fusion involves statistical matching on variables that are in common between two databases, and the quality of the project must surely depend on the effectiveness of those common variables.  Where no common variables, or if those existing ones are irrelevant to the subject, then no amount of mathematical sophistication will help.

In most situations, we do expect minimally for the databases to include basic demographics such as age, sex, geography, socio-economic status and so on.  The question then is, Are these variables sufficient to guarantee a reasonable performance for data fusion?  Or do people's behavior depend on deeper dimensions such as psychographics?  Obviously, we can bring in many statistical tools (such as correlational analyses, discriminant analyses, logistic analyses, CHAID, etc) to gauge the relations between these basic demographic variables and the outcome variables (such as television viewing and consumer product usage).

In this article, we make a simple, intuitive and non-technical presentation to argue for the effectiveness of data fusion of a couple of basic demographic variables.  The data come from the fusion of the 2001 MARS OTC/DTC Pharmaceutical Study.  This is a consumer mail survey of 23,705 adults in the USA, with the survey content focusing on the usage of magazines, television and pharmaceutical products.  This study will be fused with the television audience measurement from the Nielsen Television Index.

FACT #1:  Television viewing is strongly correlated with age/sex

Once upon a time --- in fact, a long time ago --- television was regarded as a mass medium that deliver large number of television viewers.  As more television options emerged, the television audience became fragmented.  The most powerful discriminator of television choice has been age/sex demographics.  Today, television programs are sold based upon the guarantee of the total audience size as well as by some key age/sex demographic groups.  Nothing else has been observed to make the same powerful and consistent segmentation.

The chart below is a correspondence map of television program type by age/sex.  This statistical technique makes a graphical representation of the relationships among the variables in a two-dimensional graph.  The correspondence analysis has no semantic understanding of the nature of these variables.  Purely from the internal structure of the responses from the survey participants, it was able to deduce the age/sex structure of the population.

On this map, men are in the lower half and men are on the top half; younger people are on the left half and older people are on the right half.  The television program types fall into place where one would expect them to.  These are commonly known facts about television program preferences.


(Source: 2001 MARS OTC/DTC Pharmaceutical Study)

FACT #2:  Magazine readership is strongly associated with age/sex

Once upon a time --- in fact, a long time ago --- there were only a few general interest magazines whose main goal is to deliver the largest audiences possible.  In time, the magazine industry became highly fragmented, with tens of thousands of magazines being targeted towards various niches.  Of course, there is a trade-off between niche size and efficacy --- while the content appeal is higher for a narrowly targeted niche (as evident by titles such as Arthritis Today or Diabetes Forecast), the total audience size may be limited.  Among the large consumer titles, the best single discriminator is still age/sex.  

The chart below is a correspondence map between age/sex and the readership to about 100 different magazines.  Again, without any understanding of the semantic content, the correspondence analysis revealed the age/sex structure of the population through the internal data structure.  In this map, men are at the bottom half and women are at the top half; older people are on the left and younger people are on the right.  Please note that the notion of top-versus-bottom or right-versus-left have no fixed meaning, as we could have looked the map upside down (so that the top and bottom roles are reversed) or from the back side (so that the left and right roles are reversed).   The positions of the magazines match what we know about them.


(Source: 2001 MARS OTC/DTC Pharmaceutical Study)

FACT #3:  Physical ailments are strongly associated with age/sex

Humans suffer from various kinds of ailments, but there are significant differences by physiology and lifestyles.  The chart below is a correspondence map between age/sex and various ailments.  Again, without any understanding the semantic content of the data, the correspondence map reproduced the age/sex structure of the population.  In this map, men are at the bottom half and women are at the top half; older people are on the left and younger people are on the right.


(Source: 2001 MARS OTC/DTC Pharmaceutical Study)

Based purely on these three simple pieces of fact, all of which are well known and validated from any number of other studies, it would be reasonable to assert that statistical matching in data fusion has a reasonable chance for bringing together television viewing, magazine readership and pharmaceutical product usage.  While this is reassuring, it does perhaps lead to more questions, of which this is an obvious one: If age/sex is such a powerful predictor of television usage, magazine readership and pharmaceutical product usage, then why not use this as the planning variable without ever needing to resort to data fusion?  In fact, this is the most prevalent practice today when planning for television in the absence of information beyond the simple demographics.

Of course, for the sake of simplicity in our exposition, we have referred solely to age/sex here.  In data fusion, we will actually be able to leverage all the other common variables.  Among other things, one can expect to have demographic variables such as houseohld income, geography, race, household composition, etc.  To the extent that these other variables can predict behavior beyond age/sex alone, data fusion would be more powerful than a simple age/sex classification.  

Consider the case of income.  Countless studies have shown that television viewing and magazine readership are associated with income, in that rich people tend to watch less, read more as well as differently.  There is no point in attempting to re-establish that here.  In the chart below, we show a correspondence map between annual household income and ailment conditions.  As you might expect, the income variable is unidimensional in nature, so that this correspondence map is really a straightline instead of a two-dimensional bi-plot.


(Source: 2001 MARS OTC/DTC Pharmaceutical Study)

(posted by Roland Soong, 3/17/2002)


(Return to Zona Latina's Home Page)