Sunday, June 9, 2019

On Combining Record Series of Different Lengths Into A Time Series

How do I put a bunch of different time series together into a grand time series is not a question the average person is likely to ever ask. Much less even care about. However, in the grand debate of climate change this question remains important. Many people, maybe even most, will naturally say this question was asked and answered long ago. Okay, most people don't know and really don't care. Neither of those things means asking basic questions is something which just shouldn't be done. What if reinventing the wheel produces a better wheel? And who wouldn't want a better wheel?

Maybe established wheel makers?

I have asked this basic question and I have an answer which I would like to present to anyone who is interested. I will do so in the most simple manner possible. The following examples describe this process. I have chosen to use a common waveform using the sin function as the basis for the sample data.

Consider the four time series depicted below. All four time series follow the sin wave pattern on different averages. However, only one record is complete. The other three are fragments measured from longer series. We assume these fragments follow patterns which are the same as or very similar to that of the complete series. I know this is true because I constructed the data so this would be true.


Obviously we cannot simply compute an average as we could if we had complete data for all four series. The error being the differences on average between the series. I am going to transform the data into a form where those differences are minimized. I will do so by subtracting the series average from each part of a series. This, in effect, translates the series average to zero.


The final result of this operation is shown below.


You will notice there are errors in this process with the shortest two of the four series. The reason for each of these is an estimation error for the true series average due to the length of the time series data. Each of these series is a cyclical curve of a repeating pattern. Any estimate based upon a length of time which is not equal to or a multiple of the length of this pattern will be biased by an amount dependent upon where the endpoints lie relative to the pattern.

The following shows the resulting estimate of the complete average using the transformed data above. This estimate is very close to the actual average.



The equation for the linear trend  for the estimated average is  y = -0.0004x  + .1049. The actual average is also end point biased. It's linear trend line  is y = -0.0005 + 0.1353. The true trend is zero.

The following example illustrates how this procedure performs using two series which are identical with the exception of a trend induced into a portion of one series.




Now I am going to present the same operation using the anomalies from common baseline technique which, as I understand it, is another method commonly used. For this example my baseline period corresponds to the portion of time where a trend was induced into one of the two series. I have chosen to do this to illustrate the potential for error.


As we see the fit between the two series is not as optimal as the previous process. The resulting average has exactly the same trend as the previous method with a slightly lower overall average.


I am now going to further explore how these two processes differ by looking at the range of the data produced over the length of the time series. Keep these results in mind.




Both processes are a means of meshing data series of differing lengths of time with differing base averages together. Both process share the same error source. That error source being the error in estimating how the data you have relates to the actual local average over the full time frame being studied. This is in fact an unknown.

The assumption, as I stated previously in my first example, is a pattern exists which would adequately describe all the fragmentary records we have if we only knew what that pattern was. Assuming such a pattern exists in reality, having produced an estimate of such a pattern, it is necessary to determine how good the estimate is. One way of doing so would be to test your estimated data against the data you have. The following shows how well the estimated pattern from each method, using anomalies and using my transform method, matches the patterns of the individual data sets used in my example.


My transform method produces a good match where the two series follow the same pattern. The anomaly method produces a good match only in the area where the individual series data are close to the baseline average determined during the time of a temporary localized trend.

The bottom line here is the sub average used to calculate anomalies in my example was simply an inaccurate estimate of the average of the available data. The best estimate of the average of the available data is the average of the available data. Which is so basic, isn't it?

This is the essential weakness of any attempt to combine incongruent data series of varying lengths. You do not know if the snippet of data you have represents a point in time which is higher or lower than the true average. The correct way to average the data is exactly the same way you would do so if you had complete data.

Therefore, the most accurate method to use is one which comes the closest to the ideal condition of replicating the true average for the time period you are looking at.

Now, let's discuss how an uneven number of these short time series might induce a bias into a calculated composite time series using the anomalies from a baseline method. I will again demonstrate using an example.

For this example I am using three simulated series, two of which have induced negative trends at the time I am using to establish a baseline from which to compute anomalies.


Now I am going to convert each curve to a set of anomalies from this baseline period.


As you can see the fit is not exactly optimal. Below is how this new computed average looks against the actual average.



As expected the match of the estimated curve is only good close to where the three curves were averaged. Below is how this biased average manifests in terms of the error from true.



As you can see, the net effect is to lower the estimate of the past. What causes this bias in the resulting average is the number of series which are complete from the baseline point forward and incomplete from the baseline point back. Obviously, if we had complete data sets for all three series the resulting average would have the correct shape. Also, if the length and times were evenly distributed throughout the composite time series would be more accurate as the error would be more or less evenly distributed. However, in this scenario the error is forced  into the past through the averaging process and by the increasing number of series added moving forward.

This is the mechanism whereby this process will be biased to either over estimate or under estimate the past depending upon the number of series added moving forward and direction of any anomalous trends in the baseline period.
















No comments:

Post a Comment