Caliper Matching in Data Fusion

Caliper Matching in Data Fusion

This is a technical note about a special technique that is used in data fusion (see other articles for general discussion of data fusion). Data fusion is the process whereby two separate databases (such as a consumer survey and a television audience measurement panel) are integrated through the statistical matching of the respondents. The statistical matching is done on common variables (such as age and sex) that exist in both databases. Thus, a male teenager in one database is matched with a male teenager in the other database, and so on.

When the matching variables are categorical in nature (that is, variables for which the possible outcomes fall into mutually exclusive and exhaustive categories), we can measure the performance of the statistical matching by the number of instances in which people were matched with others in the same categories on the common variables. Ideally, one would like to find perfect matches all of the time. But when the number of matching variables is large, this may not always be possible all of the time. This leads to considerations of how to balance the trade-offs of matching among the variables.

But that is going far ahead of the subject of this article. Here, we simply want to discuss another class of variables that are not categorical in nature. Instead of a small number of mutually exclusive and exhaustive categories, the outcomes of these variables are continuous in nature. The example that we will use is the number of television viewing time per day. This can range anywhere from zero to 24 hours (assuming that the person does not sleep!). When it comes to statistical matching, this seems to require a different criterion for judging failure than success/failure. To be more precise, if one person watches 181 minutes, then we may not require the matching person to be exactly 181 minutes too. It would seem that 182 minutes is quite acceptable, although a person with 684 minutes would seem too distant.

Let us consider this hypothetical example. We have a male who watches 181 minutes of television per day. There are two potential matching candidates: one is a male who watches 182 minute of television per day, and the other is a female who watches 181 minutes of television per day. If the content of the fusion is gender-related, it would be clear that the male is a much better choice than the female. After all, the difference between male and female is much bigger than the difference between 181 and 182 minutes.

Operationally, this preference may be accomplished through the method of caliper matching. Here, the continuous variable (such as the number of television viewing minutes) is discretized into mutually exclusive and exhaustive categories. An example would be to convert the number of television viewing minutes into deciles (that is, highest 10%, next highest 10%, ... , lowest 10%).

This is illustrated by the following two graphs. In the graph on the left-hand side, we show the frequency distribution of the average daily television viewing hours taken from a sample of persons in the Nielsen Television Index people meter panel. The distributions ranges from zero to sixteen hours per day. We can group these people into deciles (that is, highest 10%, next highest 10%, ... , lowest 10%). Afterwards, we verify the grouping with the box-plot graph on the right hand side. For each decile (which are numbered 0 to 9 on the horizontal axis), the box-plot shows the minimum value (= the line at the bottom), the 25% quartile (= the bottom of the box), the median (= the line in the middle of the box), the 75% quartile (= the top of the box) and the maximum value (= the line at the top). As expected, the box-plots ascend from the left to right, with no overlaps between them.

Having converted the continuous variable into a categorical variable, the statistical matching can proceed as usual. The results are illustrated in the following chart for the data fusion between the NTI people meter panel and the MARS OTC/DTC survey. The statistical matching was done by television viewing deciles as well as a list of ten other demographic variables simultaneously. An NTI person from any television viewing decile is matched against one or more MARS person from the corresponding television viewing decile. If more than one person is available, then considerable is given to matching on the demographic variables. When the data fusion is completed, a respondent-level database is constructed and consists of records with both NTI and MARS data.

The quality of the matching is shown in the following chart, which is a scatterplot in which the horizonatal axis is the average NTI daily television viewing hours and the vertical axis is the average MARS daily television viewing hours. The scatterplot takes on a block diagonal form, with a total of ten blocks corresponding to the deciles. Within each decile, the matching becomes 'random' in the sense that the statistical matching is driven by the demographic variables without consideration of the television viewing hours. Between deciles, no matching is permitted. When all said and done, the correlation coefficient between the NTI and MARS daily television viewing hours for this matched sample is 0.957. This is an exceedingly high degree of correlation, especially given the fact that one person's television viewing hours for one day is not necessarily even a perfect predictor for his/her behavior on the next day.

(posted by Roland Soong, 4/21/2002)

(Return to Zona Latina's Home Page)