-
Notifications
You must be signed in to change notification settings - Fork 0
Latchup Model Approach
-
subsections for eval section
- accuracy
- vs baselines (why they don’t work)
- overhead (overhead negligible during workload, some during idle)
- CPU power state
- won’t affect real workloads
-
snapdragon core, how to
-
look into passive observer prototype
- minimal interface:
- 1KB logs - exact file locations
- 1 byte to encode: reboot counter?
- 1-command install and run
- minimal interface:
-
Smaller shunt resistors do not affect correctness (another 1hr test)
- Because we use a running minimum, random current spikes are accounted for
-
I2C bandwidth usage should be very doable
- Raspberry Pi I2C has 400 Kb/s speed
- INA3221 in High-speed mode can go up to 400 Kb/s
- At 1 measurement per millisecond, with about 7-8 bytes per channel, at most 64 Kb/s will be needed,
- Just 6.25% of total bandwidth
-
24 1hr runs, because storage space on Pi is bad
- Collected data over 960 hrs = 40 days of tests basically
- 0.0015 false positive rate
- 0 false negative rate
- MTTF of 22 hours
-
Reduced MTTF due to false positive to once every 22 hours -> 0.002 FP rate
-
Two models for detection? one for idle and one for workloads - [x] Heuristic: running average over 3 minutes of data, 1% FP rate
- Model trained only on idle works much better for idle - 0.097 Mean Average Error
- Model trained on both workload and idle doesn’t do too well on just idle data
-
Forced quiescence - similar approach to forced context switches
- Detect natural idles and reset timer then
- Very useful for Cryptosat workloads
-
Measure runtime/current draw/CPU before/after using system
- Consistently basically no performance impact currently on SPICE benchmarks
-
see if model can minimize the cooldowns being done
- on idle: 10% FP rate per 3-minute window
- on typical workload: 5% FP rate per 3-minute window
-
test baseline of forced cooldown every 2 mins
-
look at current draw of writes with O_DIRECT
-
test adding a latchup and seeing how model behaves
-
inject the baseline and see if we can do that
-
look at thrashing detection/other OS metrics
- look at flamegraph of syscalls during thrashing
- iostat: see if performance impact big
-
wider detection window
-
graph of estimate vs real current
-
try building linear model
- take moving average of input values
- training data - sample more frequently
- kalman filter?
- take moving average of input values
-
time series validation - estimated/real current
- validate on a second run of the Pi
-
normalize features between 0/1 and then put into model
- outliers would screw up predictor in the future
-
look at how ratios affect
- normalize against CPU
- normalize to min/max
-
see if we can build a predictor just on CPU frequency
- see if predictor can be applied against memory
- linear regression on CPU/memory utilization
- check numactl single core to see if frequency matters
-
page faults
-
PCA on results
-
Pin memory bandwidth test to one core, and then look at how it performs
-
add HW cache events
- instrument read/write?
-
performance impact of logging program
- tested against timed FFTW benchmarks, zero impact due to low sampling rate
-
false positives?
-
look into other hardware counters, scheduler?
-
tom anderson's student on treehouse: trying to do energy consumption prediction by performance counter events
-
look into energy profiling previous work for raspberry pi - predicting battery consumption of apps on phone - maybe 10-15 years ago
- write-up how integral doesn't directly translate (previous work is on battery consumption)
- scheduler optimizing for battery
-
power capping in datacenters - microsoft tries to cap the power of a datacenter rack (bin-capping)