Low level utilities.
def split_array(data, idx): return data[:idx], data[idx:]
test_in = [3.3, 1.1, 2.02, 3.003, 4.0004]
test_result = r3(test_in)
assert test_in is not test_result
assert [3.3, 1.1, 2.02, 3.003, 4.0] == test_result
test_in = tuple(test_in)
test_result = r3(test_in)
assert test_in is not test_result
assert (3.3, 1.1, 2.02, 3.003, 4.0) == test_result
For efficiency, we need to be able to calculate standard deviations without processing all data with np.std
.
By keeping track of;
- count of items
c
- sum of items
s
- sum of items squared
s2
we can calculate variance with; (s2/c) - (s/c)**2
Note: we have to clamp var
at zero. When working with small numbers, they sometimes end up -ve due to numerical instability.
It might be interesting to do a moving average implementation like: https://github.com/pete88b/data-science/blob/master/pytorch-things/calculating-variance.ipynb
def np_score(y): return np.std(y)*len(y)
y = np.linspace(-2.5, 2.5, 11)
aggs = Aggs(y)
assert aggs._c == 11
for i in range(len(y)-1):
aggs.upd(y[i])
y_le, y_gt = split_array(y, i+1)
assert len(y_le) == aggs.c
assert r3(aggs.score()) == r3(np_score(y_le) + np_score(y_gt))
assert 14.361 == r3(aggs.score())
# y = x + np.random.random(x.shape)
x = np.array([-1., -0.778, -0.556, -0.332, -0.115, 0.119, 0.331, 0.556, 0.777, 1.])
y = np.array([-0.721, 0.036, 0.366, 0.490, 0.565, 1.080, 0.414, 1.125, 1.483, 1.569])
assert 0.480 == np.round(mse(x, y), 3)
assert 0.693 == np.round(rmse(x, y), 3) # expect 0.6931791254791217