i (1)
Patent-pair citation frequencies
We seek to model the citation frequencies described in Section II above, the way in which these frequencies evolve over time, and how they are affected by characteristics of the citing and cited patent. One way to approach this would be with a probit-type model, in which each citation is an observation, and the regression dataset is created by combining the actual citations with a random sample of patent pairs that did not cite each other. One could then ask how the predicted probability that a patent pair will result in a citation is affected by various regressor variables.

In this application, however, we observe approximately five million citations; if this were combined with an equal number of non-citing patent pairs, the regression dataset would have ten million observations. The number of unique combinations of values of potential regressor variables is, however, a small fraction of that. Put differently, if one were to run a probit with those ten million observations, very many of those observations would have identical values for any conceivable set of right-hand-side variables. In such a case, no information is lost by combining observations into “cells” characterized by the values of the regressor variables, and making the dependent variable the fraction of the patent pairs in the cell for which a citation occurred. In this way, we reduce the number of observations from more than five million (the exact value depending on the sampling from the non-citing pairs) into a dataset with about 50,000 observations, with little loss of relevant information.

Most of our potential regressors are categorical rather than continuous variables, such as cited country, citing country, technology field, cited year and citing year. In addition to these effects, we wish to capture the evolution of citations over elapsed time as shown in Figure 1. For this purpose we adapt the formulation of Caballero and Jaffe (1993) and Jaffe and Trajtenberg (1996).