Latchup Model Approach

subsections for eval section
- accuracy
- vs baselines (why they don’t work)
- overhead (overhead negligible during workload, some during idle)
  - CPU power state
  - won’t affect real workloads
snapdragon core, how to
look into passive observer prototype
- minimal interface:
  - 1KB logs - exact file locations
  - 1 byte to encode: reboot counter?
- 1-command install and run
Smaller shunt resistors do not affect correctness (another 1hr test)
- Because we use a running minimum, random current spikes are accounted for
I2C bandwidth usage should be very doable
- Raspberry Pi I2C has 400 Kb/s speed
- INA3221 in High-speed mode can go up to 400 Kb/s
- At 1 measurement per millisecond, with about 7-8 bytes per channel, at most 64 Kb/s will be needed,
- Just 6.25% of total bandwidth
24 1hr runs, because storage space on Pi is bad
- Collected data over 960 hrs = 40 days of tests basically
- 0.0015 false positive rate
- 0 false negative rate
- MTTF of 22 hours
Reduced MTTF due to false positive to once every 22 hours -> 0.002 FP rate
Two models for detection? one for idle and one for workloads - [x] Heuristic: running average over 3 minutes of data, 1% FP rate
- Model trained only on idle works much better for idle - 0.097 Mean Average Error
- Model trained on both workload and idle doesn’t do too well on just idle data
Forced quiescence - similar approach to forced context switches
- Detect natural idles and reset timer then
- Very useful for Cryptosat workloads
Measure runtime/current draw/CPU before/after using system
- Consistently basically no performance impact currently on SPICE benchmarks
see if model can minimize the cooldowns being done
- on idle: 10% FP rate per 3-minute window
- on typical workload: 5% FP rate per 3-minute window
test baseline of forced cooldown every 2 mins
look at current draw of writes with O_DIRECT
test adding a latchup and seeing how model behaves
inject the baseline and see if we can do that
look at thrashing detection/other OS metrics
- look at flamegraph of syscalls during thrashing
- iostat: see if performance impact big
wider detection window
graph of estimate vs real current
try building linear model
- take moving average of input values
  - training data - sample more frequently
  - kalman filter?
time series validation - estimated/real current
- validate on a second run of the Pi
normalize features between 0/1 and then put into model
- outliers would screw up predictor in the future
look at how ratios affect
- normalize against CPU
- normalize to min/max
see if we can build a predictor just on CPU frequency
- see if predictor can be applied against memory
- linear regression on CPU/memory utilization
- check numactl single core to see if frequency matters
page faults
PCA on results
Pin memory bandwidth test to one core, and then look at how it performs
add HW cache events
- instrument read/write?
performance impact of logging program
- tested against timed FFTW benchmarks, zero impact due to low sampling rate
false positives?
look into other hardware counters, scheduler?
tom anderson's student on treehouse: trying to do energy consumption prediction by performance counter events
look into energy profiling previous work for raspberry pi - predicting battery consumption of apps on phone - maybe 10-15 years ago
- write-up how integral doesn't directly translate (previous work is on battery consumption)
- scheduler optimizing for battery
power capping in datacenters - microsoft tries to cap the power of a datacenter rack (bin-capping)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latchup Model Approach

Clone this wiki locally