Balancing Velocity vs Long-Term Scalability: Engineering Philosophy from a Startup
Software Engineer
This post was written with Claude Code, distilled from my engineering journal and experiences over the years at Hotelzify.
Introduction
Every startup engineering team faces a fundamental tension: ship fast to validate product-market fit, or build robust systems for long-term scale?
The answer isn't binary. At Hotelzify, we developed a philosophy that balances both. This post shares our approach, decision-making frameworks, and lessons from optimizing a hotel booking platform.
For technical implementation details and specific optimizations, see: Critical Engineering Optimizations
The Velocity vs Scalability Dilemma
The Problem: You can't optimize for both simultaneously at every stage.
Common Pitfalls:
Ship too fast → technical debt compounds, rewrites needed
Optimize too early → slow validation, missed market opportunities
Our Approach: Velocity first, optimization second with a clear path to scale.
Our Development Philosophy
1. Functional Correctness First
Get business logic right before optimizing performance. A fast feature that doesn't meet requirements is worthless.
Example: When implementing member rate pricing for Google Hotel Center, we reused existing promotion lookup code. It was functionally correct but slow (150 database queries per batch). We shipped it, validated it worked correctly, then optimized.
Lesson: Functional correctness → production validation → optimization based on real data.
2. Incremental Optimization
Once a feature is validated, improve scalability based on actual usage patterns, not assumptions.
Real Case: Promotion lookup worked fine for occasional calls. When we extended it to batch operations (150 calls), performance became a bottleneck. We only discovered this through production metrics.
Lesson: When reusing code in new contexts, verify performance characteristics scale appropriately.
3. Measure Real Impact
Don't optimize prematurely. Ship, measure production performance, then optimize bottlenecks that actually matter.
Tools We Use:
EXPLAIN ANALYZEfor query plansApplication-level timing (
console.time)Database profiling
Production metrics monitoring
Lesson: All our optimizations came after features were live with real performance data.
Core Engineering Principles
1. Backward Compatibility is Non-Negotiable
Every change we make maintains full backward compatibility.
How:
Optional parameters with safe defaults
Graceful fallbacks for missing data
No breaking changes to existing APIs
Benefits:
Continuous deployment without coordination
No migration periods
Existing callers continue working
Example Pattern:
const someFunction = async (
requiredParam,
optionalNewParam = null // Safe default
) => {
if (!optionalNewParam) {
optionalNewParam = extractFromContext();
}
// Function works with or without new parameter
};
2. Database as Configuration, Not Code
Configuration driven by database fields, not environment variables or code deployments.
Examples:
is_member_rate_ghc→ controls strike-through pricingmlos→ controls length of stay restrictionschainId→ determines API token selection
Benefits:
Instant rollback (SQL UPDATE)
No code deployment for config changes
Per-entity control granularity
Clear audit trail
Real Impact: When implementing undocumented Google API features, database flags let us enable for 1 hotel, test, then gradually roll out. Rollback was a single SQL query.
3. Graceful Degradation
All integrations handle failures without breaking core functionality.
Pattern:
try {
const result = await externalApiCall();
return result;
} catch (error) {
console.error('External API failed:', error);
return null; // Safe default
// Core functionality continues
}
Philosophy: Degraded service > broken service.
Example: Multi-vendor token selection. Any error (missing hotelId, database error, invalid chainId) → default token. System continues working.
4. Optimize in Layers
Apply optimizations incrementally, not all at once.
Layered Approach:
| Layer | Effort | Impact | When |
|---|---|---|---|
| 1. ORM raw mode | Low | 30-50% | Always |
| 2. Indexes | Medium | 20-40% | EXPLAIN shows scans |
| 3. NULL optimization | Medium | 10-20% | NULL checks dominate |
| 4. Raw SQL | High | 40-60% | Critical paths (<20ms) |
Decision: Start Layer 1 (always beneficial). Add layers only if profiling shows bottleneck and impact justifies maintenance cost.
Benefit: Layers stack multiplicatively (raw mode + indexes = 50-70%).
5. Plan Rollback From Day One
Every feature includes a rollback strategy.
Mechanisms:
Database flags (toggle instantly)
Environment variables (remove and restart)
Code changes (revert commits)
Requirement: No permanent side effects that can't be undone.
Result: Confident production deployments. We can enable a feature for 1 entity, monitor, and rollback in seconds if needed.
Decision-Making Frameworks
When to Optimize
Red Flags (Don't Optimize):
Premature (no measurement showing problem)
Wrong bottleneck (optimizing non-critical path)
Negligible impact (improvement too small)
High complexity (makes code unmaintainable)
Temporary problem (will resolve with other changes)
Green Lights (Do Optimize):
Measured impact (clear metrics)
User-facing (affects response time/throughput)
Scalability (problem grows with usage)
Maintainable solution (doesn't add significant complexity)
Safe rollback (can revert easily)
Trade-off Analysis Pattern
Every optimization involves trade-offs. Document what you're giving up.
Example Trade-offs We Made:
| What We Gave Up | What We Gained | Why Acceptable |
|---|---|---|
| 5-10 MB memory | 150x faster batch processing | Memory is cheap, DB load expensive |
| ORM type safety | 40-60% performance | Critical paths only, most code uses ORM |
| Simple JOINs | 90% less DB CPU | Same-datacenter, network not bottleneck |
| 50 rows transferred | 250 rows (5x) but 97% less CPU | Row size small (~1KB), worth it |
Framework: Explicitly document trade-offs to make informed decisions.
Batch + Cache vs Individual Queries
Use Batch + Cache When:
Dataset <50MB
Iterations >50
Static data during processing
Complex filtering expensive in SQL
Avoid When:
Dataset >100MB
Data changes mid-loop
Memory constrained
Few iterations (<10)
Key Insight: Network round-trip (1-5ms per query) compounds at high iteration counts.
JOIN vs Separate Queries
Use Separate Queries When:
Complex aggregations (GROUP BY, JSON_AGG)
High concurrency needed
Lock duration impacts other queries
Same-datacenter deployment
Use JOIN When:
Simple joins without aggregation
Very small result sets
Cross-region database (high network latency)
Atomic consistency critical
Key Insight: Lock duration > execution time. A 35ms query holding locks is worse than two 6ms queries releasing locks early.
Concurrency Impact:
JOIN (10 concurrent): 10 × 35ms = 350ms (serialized)
Separate (10 concurrent): ~12ms total (interleaved)
Phased Rollout Strategy
Complex features benefit from multi-phase rollouts.
Case Study: MLOS Implementation
Phase 1: Database (Day 1)
- Add column, DEFAULT 1, nullable, indexed
- Zero impact on existing code
Phase 2: Service Layer (Day 2-3)
- Update CRUD operations
- Default to 1 if not provided
- Existing callers still work
Phase 3: API Response (Day 4-5)
- Include in responses
- Frontend can display
- No enforcement yet
Phase 4: Google Integration (Day 6-10)
- Push to Google Hotel Center
- Feature fully live
Benefits:
Each phase independently testable
No breaking changes
Incremental value (Phase 3 valuable without Phase 4)
Can stop at any phase
Philosophy: Enables continuous deployment without cross-team coordination.
Lessons from Real Optimizations
1. Hidden Performance Characteristics
What Happened: A promotion lookup function was built for single calls. We reused it in a batch operation (150 calls). Function was hidden in a utility method, so repeated queries weren't obvious.
Impact: 150 database queries per batch, 7-15 seconds execution time.
Solution: Batch fetch once + in-memory filtering → 1 query, 0.2 seconds.
Lesson: When reusing code, always consider performance characteristics in new contexts.
2. Lock Duration vs Execution Time
Discovery: A 35ms JOIN query was causing concurrency issues despite being "fast".
Analysis: Query held locks on multiple tables. With 10 concurrent operations, serialization created 350ms bottleneck.
Solution: Two separate queries (6ms each) with independent lock release → 12ms total, interleaved execution.
Lesson: Lock duration matters more than execution time. Shorter locks improve concurrency.
3. In-Memory Can Beat Database
Conventional Wisdom: Database is optimized for aggregations, use it.
Reality: For small datasets (hundreds of rows), JavaScript filtering is faster.
Why:
No network round-trip (1-5ms per query)
Modern JavaScript engines highly optimized
No lock contention
Simpler queries execute faster
Result: 150 queries → 1 query, 99% less database load.
Lesson: Question conventional wisdom with measurement.
4. Database Flags for Instant Rollback
Challenge: Implementing undocumented Google API (RateModifications).
Risk: Could break anytime, no official support.
Solution: Single database field: is_member_rate_enabled
TRUE → send rate modification
FALSE → skip
Rollout:
Phase 1: 1 hotel (test)
Phase 2: 10 hotels (monitor)
Phase 3: All hotels (if stable)
Rollback: One SQL UPDATE, instant effect.
Lesson: For undocumented/beta APIs, design for instant rollback without code deployment.
5. Technical Debt in Queries
Discovery: A reporting query had 7 JOINs, only 3 were actually used.
Root Cause: Features were removed/refactored, JOINs remained.
Audit Process:
List all SELECT columns
Trace to source tables
Identify unused tables
Remove safely
Impact: Simpler query, faster execution, easier maintenance.
Lesson: Technical debt accumulates in queries too. Regular audits prevent degradation.
Anti-Patterns We Avoided
1. Premature Abstraction
Didn't create generic "optimization framework" upfront. Each optimization was specific to its problem. Abstractions emerged naturally after multiple similar patterns.
Why: Early abstractions often miss the mark. Let patterns emerge from real use cases.
2. Over-Engineering
Didn't add features "just in case". MLOS implementation added exactly what was needed in each phase, nothing more.
Why: YAGNI (You Aren't Gonna Need It). Build for current needs, not hypothetical futures.
3. Ignoring Trade-offs
Explicitly documented what we gave up for each optimization. No solution is perfect; knowing trade-offs helps make informed decisions.
Why: Honest assessment prevents regret later. "We chose X knowing we gave up Y" is better than "Why is Y broken?"
4. Optimizing Without Measuring
Every optimization started with metrics. No guessing about bottlenecks.
Why: Intuition about performance is often wrong. Measurement reveals truth.
The Compound Effect
Individual optimizations stack:
Database Optimizations:
Batch + cache: 150x reduction
Separate queries: 90% less DB CPU
ORM raw mode + indexes: 50-70% faster
System Impact:
Price push: 15s → 0.2s (75x faster)
Database CPU: 90% reduction
Concurrent throughput: Improved via shorter locks
Development Velocity:
MLOS: 4 phases in 2 weeks
Strike-through pricing: Instant rollout control
Multi-vendor: 17 functions, zero breaking changes
Philosophy Enabled This: Velocity first → ship and validate → measure → optimize → maintain backward compat → compound improvements.
Practical Takeaways
For Early-Stage Startups
Velocity First: Get to product-market fit before optimizing
Functional Correctness: Validate business logic before performance
Measure Real Usage: Production data reveals actual bottlenecks
Plan Rollback: Database flags enable confident experimentation
For Growing Startups
Optimize Strategically: Fix real bottlenecks, not hypothetical ones
Backward Compatibility: Enables continuous deployment
Phased Rollouts: De-risk complex features
Document Trade-offs: Inform future decisions
For Scale-ups
Lock Duration: Matters more than execution time
In-Memory Processing: Can beat database for small datasets
Technical Debt Audits: Prevent query degradation
Layered Optimization: Incremental improvements stack
Conclusion
Balancing velocity and scalability isn't about choosing one over the other. It's about:
Sequence: Velocity first, optimization second
Measurement: Real data, not assumptions
Compatibility: Never break existing functionality
Rollback: Plan for instant reversal
Trade-offs: Document what you're giving up
Iteration: Small improvements compound over time
This philosophy enabled us to:
Ship features quickly (validate product-market fit)
Optimize based on real usage (no premature optimization)
Maintain system stability (backward compatibility)
Scale confidently (measured improvements)
Performance optimization is a journey, not a destination. Start with measurement, make targeted improvements, measure impact, repeat.
Discussion
How does your team balance velocity and scalability? What frameworks do you use for deciding when to optimize? Share your experiences in the comments.
For technical details on specific optimizations, see: Critical Engineering Optimizations
Questions? Happy to discuss in the comments.