The performance of buildings participating in demand response (DR) programs is usually evaluated with baseline models, which predict what electric demand would have been if a DR event had not been called. Different baseline models produce different results. Moreover, modelers implementing the same baseline model often make different model implementation choices producing different results. Using real data from a DR program in CA and a regression-based baseline model, which relates building demand to time of week, outdoor air temperature, and building operational mode, we analyze the effect of model implementation choices on DR shed estimates. Results indicate strong sensitivities to the outdoor air temperature data source and bad data filtration methods, with standard deviations of differences in shed estimates of ≈20–30 kW, and weaker sensitivities to demand/temperature data resolution, data alignment, and methods for determining when buildings are occupied, with standard deviations of differences in shed estimates of ≈2–5 kW.