The GCP Data
The GCP network uses high quality random event generators that produce nearly ideal random numbers. Their output is almost indistinguishable from theoretical expectations even in large samples. Of course these real-world electronic devices are not perfect theoretical random sources. They inevitably have minute but real residual internal correlations and component interactions, and there are occasional failures. For example, when the power supply is compromised, the internal power regulation may not be able to adequately compensate. The result can be a bad data sequence, or more often, a bad trial or two generated in the transition to complete failure. These are infrequent occurrences, but they are important because the effects we find in analysis are small changes in statistical parameters. It is therefore necessary to identify and remove bad trials and data segments.
History of Online REGsThe following graph shows the 8-year history of online regs in the network. Each blue line represents the period of time a single reg was reporting data. The black trace is the daily sum of online regs. The red trace is the daily sum of online regs minus null trials (see also the section on nulls, below). The graph shows the long running history of some nodes as well as the fair number of nodes with short lives. We also can see the actual network growth and how the network changes even with a more or less constant node number. The relatively flat trend beginning in 2004 reflects a decision to maintain the network at about 65-70 nodes, a number that is manageable with available resources.
Figure: History of the number of Online REGs in the GCP network.
Random SourcesThe data are produced by three different makes of electronic random event generators (REG or RNG): Pear, Mindsong and Orion. All data trials are sums of 200 bits. The trialsums are collected once a second at each host site. The Pear and Mindsong devices produce about 12 trials per second and the Orions about 39 (thus 95% of the source data is not collected). REGs are added to the network over time, and the data represent an evolving set of REGs and geographical nodes. The network started with 4 REGs and currently has about 65 in operation. The changing size and distribution of the array contributes to the complexity of some analyses.
Figure: Number of reporting regs in the GCP network by device type.
The data are available for download from the GCP website using a Web-Based Extract Form. Following is a small sample (10 seconds, half the eggs) of the CSV file presented by the form. More detail is available on the data format. Analysts will need further information on file retrieval and processing.
10,1,10,"Samples per record" 10,2,10,"Seconds per record" 10,3,30,"Records per packet" 10,4,200,"Trial size" 11,1,55,"Eggs reporting" 11,2,1102294861,"Start time",2004-12-06 01:01:01 11,3,1102294870,"End time",2004-12-06 01:01:10 11,4,10,"Seconds of data" 12,"gmtime","Date/Time",1,28,37,100,101,102,105,106,108,110,111,112,114,115,116,119,134,161,226,228,231,1004,1005,1021,1022,1025,1026,... 13,1102294861,2004-12-06 01:01:01,111,106,97,93,93,100,116,103,91,88,94,103,85,94,94,99,100,102,103,97,89,114,91,93,100,96,,100,89,103,... 13,1102294862,2004-12-06 01:01:02,95,105,127,106,94,105,100,100,96,99,88,98,101,107,95,103,106,101,105,102,96,95,94,99,101,107,88,100,... 13,1102294863,2004-12-06 01:01:03,84,98,99,109,96,103,96,116,116,102,88,108,97,95,95,92,89,104,105,96,106,105,112,98,115,102,,107,90,... 13,1102294864,2004-12-06 01:01:04,95,105,102,106,83,103,77,99,93,88,101,105,95,109,90,94,107,98,92,108,91,99,102,97,101,109,92,105,100,... 13,1102294865,2004-12-06 01:01:05,104,100,108,107,100,97,101,99,97,92,104,102,110,102,90,105,93,93,86,88,75,109,106,108,99,99,,88,111,... 13,1102294866,2004-12-06 01:01:06,97,106,102,96,101,100,104,101,95,109,94,100,92,97,98,102,114,97,99,109,94,103,81,95,93,104,,99,94,99,... 13,1102294867,2004-12-06 01:01:07,92,103,95,108,100,101,97,103,109,88,113,110,102,97,94,96,110,86,99,99,93,104,86,104,97,100,,97,107,... 13,1102294868,2004-12-06 01:01:08,100,105,92,93,107,94,96,93,108,92,94,84,91,84,102,103,107,113,107,109,98,100,103,99,96,94,,89,94,92,... 13,1102294869,2004-12-06 01:01:09,93,91,104,90,113,90,89,92,104,101,83,93,106,96,98,103,98,100,93,108,109,102,98,121,99,91,,95,105,105... 13,1102294870,2004-12-06 01:01:10,111,96,104,87,98,98,105,99,97,100,102,95,95,113,107,103,96,109,112,123,107,108,109,91,96,105,91,101,... : :
It is fairly common that egg nodes send null trials. Nulls may persist for long times, as when a host site goes down, or may appear intermitently. Nulls do not cause problems for calculations on the data, but they can add to the inherent variability of some statistics.
Figure: The plot shows the presence of null trials in the data. The regs reporting is the number of reg hosts that send data packets on a given day. An indication of the amount of null trials is given by the difference of reporting regs and the effective non-null reg count which counts a reg as non-null if it has at least one non-null trial for a minute. The count of non-null minutes is averaged per reg for the day. The gray shading shows the cumulative total of regs listed in the network. The number of listed regs is greater than the number of reporting regs because regs are removed when a host site goes offline or retires from the network.
Some statistics summarizing the dimensions and composition of the database are listed in the following table. (Note: the raw data files list times and values of reg output. This doesn't constitute a database in the strict sense of the term. We use database in a looser sense, to refer to all the GCP data through Sept. 8, 2004)
Normalizing the Data
The Logical XOR
Ideally, trials distribute like binomial[200, 0.5] (mean 100, variance 50). But although they all are high-quality random sources, this is not the case for these real-life devices. A logical XOR of the raw bit-stream with a fixed pattern of bits with exactly 0.5 probability compensates mean biases of the regs. The Pear and Orion regs use a "01" bitmask and the Mindsong uses a 560-bit mask (the Mindsong mask is the string of all possible 8-bit combinations of 4 "0"'s and 4 "1"'s. Analysis confirms that the XOR'd data has very good, stable means. XORing does more than correct mean bias. For example, XORing a binomial[N,p] will transform it to a binomial[N, 0.5], so the variance is transformed as well. (See further comments on this point.)
Using the XOR to mitigate possible biases has implications. For example, Jeff Scargle says,"I feel this process rejects the kind of signal that most people probably think is being searched for ... namely consciousness affecting the RNGs ... but because of the XOR, you are only sensitive to consciousness affecting the final data stream. We discussed this before, but I still find it amazing that the entire operation has thrown out at least this baby along with the bathwater."
The two main points I make in response are:
1) the XOR is used to exclude an important class of potential spurious effects -- biases that might arise from temperature changes, component aging, etc.
2) the XORed data streams do exhibit anomalous structure in controlled laboratory experiments, as well as in the timeseries we record for the GCP. They do not show extra-chance anomalous structure in calibrations and control sequences.
See also the description of the REG devices used in the GCP, which includes discussion of the XOR procedure's purposes and implementation. A note responding to skeptical concerns about the XOR contains more technical detail.
After XOR'ing, the mean is guaranteed over the long run to fit theoretical expectation. The trial variances remain biased, however. The biases are small (about 1 part in 10,000) and generally stable on long timescales. They are corrected by standardization of the trialsums to standard normal variables (z-scores). Mindsong regs tend to have positive biases. This gives a net positive variance bias to the data. Since the GCP hypothesis explicitly looks for a positive variance bias, there is a requirement for important, albeit small, corrections. The variance biases tell us that the raw, unXOR'd trials cannot be modeled as a simple binomial with shifted bit probability.
Figure: Deviations from theoretical expectation for Mindsong and Orion regs. Expectation for the variance of z-scores is 1. The axis scale gives some indication of the size of the biases, typically less than 0.25%.
Figure: Empirical mean values per quarter of the device variance for Mindsong REGs (blue) and the Orion and PEAR devices (red). Expectation for the variance of z-scores is 1. The axis scale gives some indication of the size of the biases, averaging less than 0.1% for Mindsong and nearly zero for the Orions after 2000, when the N had increased to allow reasonable sample sizes.
The biases are small, even for the few devices that look like outliers in the figure above, but they are stable, and given large samples of months and years of data, they become statistically significant. We treat them as real biases that need to be corrected by normalization for rigorous analysis. The sensitivity of analyses to variance bias depends on the statistic calculated. Two typical calculations are the trial variance and the variance of the trial means across regs at 1-second intervals. That is, Var(z) and Var(Z)==Var(Sum(z)), where z are the reg trials and the sum is over all regs for each second. We sometimes refer to these as the device and network variances, respectively. (Note: when using standardized trial z-scores, there is little difference between variances calculated with respect to theoretical or sample means; theoretical and sample variances will be distinguished where necessary).
Identifying and Treating Bad Data
Bad Network Days
On a few days, the network produced faulty or incomplete data. These occurred during the first weeks after the GCP began operation and during hacker attack in August 2001. The days August 11, 25, 31 and September 6 have less than 86400 seconds of data. These days are retained in the database. For the days August 5-8, inclusive, the data consists mostly of nulls for all regs. These days have been removed from the standardized data.
The regs occasionally produce improbable trial values. This is usually
associated with intermitent hardware problems such as a sudden loss of
power during sampling or buffer reads.
These trials are removed
before analysis. All trialsums that deviate by 45 or more from the
theoretical mean of 100 are removed and replaced by nulls. See the data preparation page for more details.
Rotten EggsAfter out-of-bound trials have been removed, the mean and variance of each reg are checked for stability. The following links show graphs of actual data for individual eggs for long periods (up to four years).
Sections of reg data that do not pass stability criteria are masked and excluded from analysis. Data from these "rotten eggs" are usually very obvious, as the next figure shows. There are cases where excluding data is a judgment call. The current criteria impose a limit that will, on average, exclude 0.02% of valid data (an hour or two of data per year). See the data preparation page. For more detail, contact the project director, Roger Nelson.