Skip to content

Latchup Model Approach

Harry edited this page Sep 13, 2024 · 1 revision
  • subsections for eval section

    • accuracy
    • vs baselines (why they don’t work)
    • overhead (overhead negligible during workload, some during idle)
      • CPU power state
      • won’t affect real workloads
  • snapdragon core, how to

  • look into passive observer prototype

    • minimal interface:
      • 1KB logs - exact file locations
      • 1 byte to encode: reboot counter?
    • 1-command install and run
  • Smaller shunt resistors do not affect correctness (another 1hr test)

    • Because we use a running minimum, random current spikes are accounted for
  • I2C bandwidth usage should be very doable

    • Raspberry Pi I2C has 400 Kb/s speed
    • INA3221 in High-speed mode can go up to 400 Kb/s
    • At 1 measurement per millisecond, with about 7-8 bytes per channel, at most 64 Kb/s will be needed,
    • Just 6.25% of total bandwidth
  • 24 1hr runs, because storage space on Pi is bad

    • Collected data over 960 hrs = 40 days of tests basically
    • 0.0015 false positive rate
    • 0 false negative rate
    • MTTF of 22 hours
  • Reduced MTTF due to false positive to once every 22 hours -> 0.002 FP rate

  • Two models for detection? one for idle and one for workloads - [x] Heuristic: running average over 3 minutes of data, 1% FP rate

    • Model trained only on idle works much better for idle - 0.097 Mean Average Error
    • Model trained on both workload and idle doesn’t do too well on just idle data
  • Forced quiescence - similar approach to forced context switches

    • Detect natural idles and reset timer then
    • Very useful for Cryptosat workloads
  • Measure runtime/current draw/CPU before/after using system

    • Consistently basically no performance impact currently on SPICE benchmarks
  • see if model can minimize the cooldowns being done

    • on idle: 10% FP rate per 3-minute window
    • on typical workload: 5% FP rate per 3-minute window
  • test baseline of forced cooldown every 2 mins

  • look at current draw of writes with O_DIRECT

  • test adding a latchup and seeing how model behaves

  • inject the baseline and see if we can do that

  • look at thrashing detection/other OS metrics

    • look at flamegraph of syscalls during thrashing
    • iostat: see if performance impact big
  • wider detection window

  • graph of estimate vs real current

  • try building linear model

    • take moving average of input values
      • training data - sample more frequently
      • kalman filter?
  • time series validation - estimated/real current

    • validate on a second run of the Pi
  • normalize features between 0/1 and then put into model

    • outliers would screw up predictor in the future
  • look at how ratios affect

    • normalize against CPU
    • normalize to min/max
  • see if we can build a predictor just on CPU frequency

    • see if predictor can be applied against memory
    • linear regression on CPU/memory utilization
    • check numactl single core to see if frequency matters
  • page faults

  • PCA on results

  • Pin memory bandwidth test to one core, and then look at how it performs

  • add HW cache events

    • instrument read/write?
  • performance impact of logging program

    • tested against timed FFTW benchmarks, zero impact due to low sampling rate
  • false positives?

  • look into other hardware counters, scheduler?

  • tom anderson's student on treehouse: trying to do energy consumption prediction by performance counter events

  • look into energy profiling previous work for raspberry pi - predicting battery consumption of apps on phone - maybe 10-15 years ago

    • write-up how integral doesn't directly translate (previous work is on battery consumption)
    • scheduler optimizing for battery
  • power capping in datacenters - microsoft tries to cap the power of a datacenter rack (bin-capping)

Clone this wiki locally