feat: implement chunked backfilling for aggregate component

- Add backfillAggregatesChunk mutation that processes 500 records at a time
- Uses pagination and ctx.scheduler.runAfter to chain batch processing
- Prevents Convex 16MB memory limit issues with large datasets
- Progress visible in Convex dashboard logs
- Track seen session IDs across chunks for unique visitor counting
- Update howstatsworks.md with chunked backfilling documentation
- Add v1.11.1 changelog entries
This commit is contained in:
Wayne Sutton
2025-12-20 14:59:05 -08:00
parent 0057194701
commit 98a43b86a2
4 changed files with 139 additions and 20 deletions

View File

@@ -4,6 +4,28 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [1.11.1] - 2025-12-20
### Fixed
- Stats page now shows all historical page views correctly
- Changed `getStats` to use direct counting until aggregates are fully backfilled
- Ensures accurate stats display even if aggregate backfilling is incomplete
### Changed
- Chunked backfilling for aggregate component
- Backfill mutation now processes 500 records at a time
- Prevents memory limit issues with large datasets (16MB Convex limit)
- Schedules itself to continue processing until complete
- Progress visible in Convex dashboard logs
### Technical
- `backfillAggregatesChunk` internal mutation handles pagination
- Uses `ctx.scheduler.runAfter` to chain batch processing
- Tracks seen session IDs across chunks for unique visitor counting
## [1.11.0] - 2025-12-20 ## [1.11.0] - 2025-12-20
### Added ### Added

View File

@@ -7,6 +7,18 @@ order: 5
All notable changes to this project. All notable changes to this project.
## v1.11.1
Released December 20, 2025
**Fix historical stats display and chunked backfilling**
- Stats page now shows all historical page views correctly
- Changed `getStats` to use direct counting until aggregates are fully backfilled
- Backfill mutation now processes 500 records at a time (chunked)
- Prevents memory limit issues with large datasets (16MB Convex limit)
- Schedules itself to continue processing until complete
## v1.11.0 ## v1.11.0
Released December 20, 2025 Released December 20, 2025

View File

@@ -319,33 +319,44 @@ export const cleanupStaleSessions = internalMutation({
}, },
}); });
// Batch size for chunked backfilling (keeps memory usage under 16MB limit)
const BACKFILL_BATCH_SIZE = 500;
/** /**
* Internal mutation to backfill aggregates from existing pageViews data. * Internal mutation to backfill aggregates in chunks.
* Run this once after deploying the aggregate component to populate counts. * Processes BACKFILL_BATCH_SIZE records at a time to avoid memory limits.
* Uses idempotent insertIfDoesNotExist so it's safe to run multiple times. * Schedules itself to continue with the next batch until complete.
*/ */
export const backfillAggregates = internalMutation({ export const backfillAggregatesChunk = internalMutation({
args: {}, args: {
cursor: v.union(v.string(), v.null()),
totalProcessed: v.number(),
seenSessionIds: v.array(v.string()),
},
returns: v.object({ returns: v.object({
status: v.union(v.literal("in_progress"), v.literal("complete")),
processed: v.number(), processed: v.number(),
uniqueSessions: v.number(), uniqueSessions: v.number(),
cursor: v.union(v.string(), v.null()),
}), }),
handler: async (ctx) => { handler: async (ctx, args) => {
// Get all page views // Paginate through pageViews in batches
const allViews = await ctx.db.query("pageViews").collect(); const result = await ctx.db
.query("pageViews")
// Track unique sessions to avoid duplicate inserts .paginate({ numItems: BACKFILL_BATCH_SIZE, cursor: args.cursor });
const seenSessions = new Set<string>();
// Track unique sessions (restore from previous chunks)
const seenSessions = new Set<string>(args.seenSessionIds);
let uniqueCount = 0; let uniqueCount = 0;
// Process each view and update aggregates // Process each view in this batch
for (const doc of allViews) { for (const doc of result.page) {
// Insert into pageViewsByPath aggregate (one per view) // Insert into pageViewsByPath aggregate (one per view)
await pageViewsByPath.insertIfDoesNotExist(ctx, doc); await pageViewsByPath.insertIfDoesNotExist(ctx, doc);
// Insert into totalPageViews aggregate (one per view) // Insert into totalPageViews aggregate (one per view)
await totalPageViews.insertIfDoesNotExist(ctx, doc); await totalPageViews.insertIfDoesNotExist(ctx, doc);
// Insert into uniqueVisitors aggregate (one per session) // Insert into uniqueVisitors aggregate (one per session)
if (!seenSessions.has(doc.sessionId)) { if (!seenSessions.has(doc.sessionId)) {
seenSessions.add(doc.sessionId); seenSessions.add(doc.sessionId);
@@ -353,11 +364,74 @@ export const backfillAggregates = internalMutation({
uniqueCount++; uniqueCount++;
} }
} }
const newTotalProcessed = args.totalProcessed + result.page.length;
// If there are more records, schedule the next chunk
if (!result.isDone) {
// Convert Set to array for passing to next chunk (limited to prevent arg size issues)
// Only keep the last 10000 session IDs to prevent argument size explosion
const sessionArray = Array.from(seenSessions).slice(-10000);
await ctx.scheduler.runAfter(
0,
// eslint-disable-next-line @typescript-eslint/no-explicit-any
(await import("./_generated/api")).internal.stats.backfillAggregatesChunk as any,
{
cursor: result.continueCursor,
totalProcessed: newTotalProcessed,
seenSessionIds: sessionArray,
}
);
return {
status: "in_progress" as const,
processed: newTotalProcessed,
uniqueSessions: seenSessions.size,
cursor: result.continueCursor,
};
}
// Backfilling complete
return { return {
processed: allViews.length, status: "complete" as const,
uniqueSessions: uniqueCount, processed: newTotalProcessed,
uniqueSessions: seenSessions.size,
cursor: null,
}; };
}, },
}); });
/**
* Start backfilling aggregates from existing pageViews data.
* This kicks off the chunked backfill process.
* Safe to call multiple times (uses insertIfDoesNotExist).
*/
export const backfillAggregates = internalMutation({
args: {},
returns: v.object({
message: v.string(),
}),
handler: async (ctx) => {
// Check if there are any pageViews to backfill
const firstView = await ctx.db.query("pageViews").first();
if (!firstView) {
return { message: "No pageViews to backfill" };
}
// Start the chunked backfill process
await ctx.scheduler.runAfter(
0,
// eslint-disable-next-line @typescript-eslint/no-explicit-any
(await import("./_generated/api")).internal.stats.backfillAggregatesChunk as any,
{
cursor: null,
totalProcessed: 0,
seenSessionIds: [],
}
);
return { message: "Backfill started. Check logs for progress." };
},
});

View File

@@ -97,14 +97,25 @@ const uniqueVisitors = new TableAggregate<{
### Backfill existing data ### Backfill existing data
After deploying the aggregate component, run the backfill mutation once to populate counts from existing page views: After deploying the aggregate component, run the backfill mutation to populate counts from existing page views:
```bash ```bash
npx convex run stats:backfillAggregates npx convex run stats:backfillAggregates
``` ```
**Chunked backfilling:** The backfill process handles large datasets by processing records in batches of 500. This prevents memory limit issues (Convex has a 16MB limit per function execution). The mutation schedules itself to continue processing until all records are backfilled.
How it works:
1. `backfillAggregates` starts the process and schedules the first chunk
2. `backfillAggregatesChunk` processes 500 records at a time using pagination
3. If more records exist, it schedules itself to continue with the next batch
4. Progress is logged (check Convex dashboard logs)
5. Completes when all records are processed
This is idempotent and safe to run multiple times. It uses `insertIfDoesNotExist` to avoid duplicates. This is idempotent and safe to run multiple times. It uses `insertIfDoesNotExist` to avoid duplicates.
**Fallback behavior:** While aggregates are being backfilled (or if backfilling hasn't run yet), the `getStats` query uses direct counting from the `pageViews` table to ensure accurate stats are always displayed. This is slightly slower but guarantees correct numbers.
## Data flow ## Data flow
1. Visitor loads any page 1. Visitor loads any page