Data Chunking for Large Datasets

Process large datasets efficiently by breaking them into manageable chunks to reduce memory consumption and improve performance.

The Problem: Memory Exhaustion

// BAD: Loading all records into memory
$users = User::all(); // Could be millions of records!

foreach ($users as $user) {
    $user->sendNewsletter();
}

// BAD: Even with select, still loads everything
$emails = User::pluck('email'); // Array of millions of emails

foreach ($emails as $email) {
    Mail::to($email)->send(new Newsletter());
}

Solution: Chunking Methods

1. Basic Chunking with `chunk()`

// Process 100 records at a time
User::chunk(100, function ($users) {
    foreach ($users as $user) {
        $user->calculateStatistics();
        $user->save();
    }
});

// With conditions
User::where('active', true)
    ->chunk(200, function ($users) {
        foreach ($users as $user) {
            ProcessUserJob::dispatch($user);
        }
    });

2. Chunk By ID for Safer Updates

// Prevents issues when modifying records during iteration
User::where('newsletter_sent', false)
    ->chunkById(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['newsletter_sent' => true]);
            Mail::to($user)->send(new Newsletter());
        }
    });

// With custom column
Payment::where('processed', false)
    ->chunkById(100, function ($payments) {
        foreach ($payments as $payment) {
            $payment->process();
        }
    }, 'payment_id'); // Custom ID column

3. Lazy Collections for Memory Efficiency

// Uses PHP generators, minimal memory footprint
User::where('created_at', '>=', now()->subDays(30))
    ->lazy()
    ->each(function ($user) {
        $user->recalculateScore();
    });

// With chunking size control
User::lazy(100)->each(function ($user) {
    ProcessRecentUser::dispatch($user);
});

// Filter and map with lazy collections
$results = User::lazy()
    ->filter(fn($user) => $user->hasActiveSubscription())
    ->map(fn($user) => [
        'id' => $user->id,
        'revenue' => $user->calculateRevenue(),
    ])
    ->take(1000);

4. Cursor for Forward-Only Iteration

// Most memory-efficient for simple forward iteration
foreach (User::where('active', true)->cursor() as $user) {
    $user->updateLastSeen();
}

// With lazy() for additional collection methods
User::where('verified', true)
    ->cursor()
    ->filter(fn($user) => $user->hasCompletedProfile())
    ->each(fn($user) => SendWelcomeEmail::dispatch($user));

Real-World Examples

Export Large CSV

class ExportUsersCommand extends Command
{
    public function handle()
    {
        $file = fopen('users.csv', 'w');

        // Write headers
        fputcsv($file, ['ID', 'Name', 'Email', 'Created At']);

        // Process in chunks to avoid memory issues
        User::select('id', 'name', 'email', 'created_at')
            ->chunkById(500, function ($users) use ($file) {
                foreach ($users as $user) {
                    fputcsv($file, [
                        $user->id,
                        $user->name,
                        $user->email,
                        $user->created_at->toDateTimeString(),
                    ]);
                }

                // Optional: Show progress
                $this->info("Processed up to ID: {$users->last()->id}");
            });

        fclose($file);
        $this->info('Export completed!');
    }
}

Batch Email Campaign

class SendCampaignJob implements ShouldQueue
{
    public function handle()
    {
        $campaign = Campaign::find($this->campaignId);

        // Process subscribers in chunks
        $campaign->subscribers()
            ->where('unsubscribed', false)
            ->chunkById(50, function ($subscribers) use ($campaign) {
                foreach ($subscribers as $subscriber) {
                    SendCampaignEmail::dispatch($campaign, $subscriber)
                        ->onQueue('emails')
                        ->delay(now()->addSeconds(rand(1, 10)));
                }

                // Prevent rate limiting
                sleep(2);
            });
    }
}

Data Migration/Transformation

class MigrateUserData extends Command
{
    public function handle()
    {
        $bar = $this->output->createProgressBar(User::count());

        User::with(['profile', 'settings'])
            ->chunkById(100, function ($users) use ($bar) {
                DB::transaction(function () use ($users, $bar) {
                    foreach ($users as $user) {
                        // Complex transformation
                        $newData = $this->transformUserData($user);

                        NewUserModel::create($newData);

                        $bar->advance();
                    }
                });
            });

        $bar->finish();
        $this->newLine();
        $this->info('Migration completed!');
    }
}

Cleanup Old Records

class CleanupOldLogs extends Command
{
    public function handle()
    {
        $deletedCount = 0;

        ActivityLog::where('created_at', '<', now()->subMonths(6))
            ->chunkById(1000, function ($logs) use (&$deletedCount) {
                $ids = $logs->pluck('id')->toArray();

                // Batch delete for efficiency
                ActivityLog::whereIn('id', $ids)->delete();

                $deletedCount += count($ids);

                $this->info("Deleted {$deletedCount} records so far...");

                // Give database a breather
                usleep(100000); // 100ms
            });

        $this->info("Total deleted: {$deletedCount}");
    }
}

Choosing the Right Method

| Method | Use Case | Memory Usage | Notes | | ------------- | ------------------------ | ---------------- | ---------------------------------------------- | | chunk() | General processing | Moderate | May skip/duplicate if modifying filter columns | | chunkById() | Updates during iteration | Moderate | Safer for modifications | | lazy() | Large result processing | Low | Returns LazyCollection | | cursor() | Simple forward iteration | Lowest | Returns Generator | | each() | Simple operations | High (loads all) | Avoid for large datasets |

Performance Optimization Tips

1. Select Only Needed Columns

User::select('id', 'email', 'name')
    ->chunkById(100, function ($users) {
        // Process with minimal data
    });

2. Use Indexes

// Ensure indexed columns in where clauses
User::where('status', 'active') // status should be indexed
    ->where('created_at', '>', $date) // created_at should be indexed
    ->chunkById(200, function ($users) {
        // Process efficiently
    });

3. Disable Eloquent Events When Appropriate

User::withoutEvents(function () {
    User::chunkById(500, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]);
        }
    });
});

4. Use Raw Queries for Bulk Updates

// Instead of updating each record
User::chunkById(100, function ($users) {
    $ids = $users->pluck('id')->toArray();

    // Bulk update with raw query
    DB::table('users')
        ->whereIn('id', $ids)
        ->update([
            'last_processed_at' => now(),
            'processing_count' => DB::raw('processing_count + 1'),
        ]);
});

5. Queue Large Operations

class ProcessLargeDataset extends Command
{
    public function handle()
    {
        User::chunkById(100, function ($users) {
            ProcessUserBatch::dispatch($users->pluck('id'))
                ->onQueue('heavy-processing');
        });
    }
}

class ProcessUserBatch implements ShouldQueue
{
    public function __construct(
        public Collection $userIds
    ) {}

    public function handle()
    {
        User::whereIn('id', $this->userIds)
            ->get()
            ->each(fn($user) => $user->process());
    }
}

Testing Chunked Operations

test('processes all active users in chunks', function () {
    // Create test data
    User::factory()->count(150)->create(['active' => true]);
    User::factory()->count(50)->create(['active' => false]);

    $processed = [];

    User::where('active', true)
        ->chunkById(50, function ($users) use (&$processed) {
            foreach ($users as $user) {
                $processed[] = $user->id;
            }
        });

    expect($processed)->toHaveCount(150);
    expect(count(array_unique($processed)))->toBe(150);
});

test('handles empty datasets gracefully', function () {
    $callCount = 0;

    User::where('id', '<', 0) // No results
        ->chunk(100, function ($users) use (&$callCount) {
            $callCount++;
        });

    expect($callCount)->toBe(0);
});

Common Pitfalls

Modifying filter columns during chunk()

// WRONG: May skip records
User::where('processed', false)
    ->chunk(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]); // Changes the WHERE condition!
        }
    });

// CORRECT: Use chunkById()
User::where('processed', false)
    ->chunkById(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]);
        }
    });

Not handling chunk callback returns

// Return false to stop chunking
User::chunk(100, function ($users) {
    foreach ($users as $user) {
        if ($user->hasIssue()) {
            return false; // Stop processing
        }
        $user->process();
    }
});

Ignoring database connection limits

// Consider connection timeouts for long operations
DB::connection()->getPdo()->setAttribute(PDO::ATTR_TIMEOUT, 3600);

User::chunkById(100, function ($users) {
    // Long running process
});

Remember: When dealing with large datasets, always think about memory usage, query efficiency, and processing time. Chunk your data appropriately!

Data Chunking for Large Datasets

Process large datasets efficiently by breaking them into manageable chunks to reduce memory consumption and improve performance.

The Problem: Memory Exhaustion

// BAD: Loading all records into memory
$users = User::all(); // Could be millions of records!

foreach ($users as $user) {
    $user->sendNewsletter();
}

// BAD: Even with select, still loads everything
$emails = User::pluck('email'); // Array of millions of emails

foreach ($emails as $email) {
    Mail::to($email)->send(new Newsletter());
}

Solution: Chunking Methods

1. Basic Chunking with `chunk()`

// Process 100 records at a time
User::chunk(100, function ($users) {
    foreach ($users as $user) {
        $user->calculateStatistics();
        $user->save();
    }
});

// With conditions
User::where('active', true)
    ->chunk(200, function ($users) {
        foreach ($users as $user) {
            ProcessUserJob::dispatch($user);
        }
    });

2. Chunk By ID for Safer Updates

// Prevents issues when modifying records during iteration
User::where('newsletter_sent', false)
    ->chunkById(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['newsletter_sent' => true]);
            Mail::to($user)->send(new Newsletter());
        }
    });

// With custom column
Payment::where('processed', false)
    ->chunkById(100, function ($payments) {
        foreach ($payments as $payment) {
            $payment->process();
        }
    }, 'payment_id'); // Custom ID column

3. Lazy Collections for Memory Efficiency

// Uses PHP generators, minimal memory footprint
User::where('created_at', '>=', now()->subDays(30))
    ->lazy()
    ->each(function ($user) {
        $user->recalculateScore();
    });

// With chunking size control
User::lazy(100)->each(function ($user) {
    ProcessRecentUser::dispatch($user);
});

// Filter and map with lazy collections
$results = User::lazy()
    ->filter(fn($user) => $user->hasActiveSubscription())
    ->map(fn($user) => [
        'id' => $user->id,
        'revenue' => $user->calculateRevenue(),
    ])
    ->take(1000);

4. Cursor for Forward-Only Iteration

// Most memory-efficient for simple forward iteration
foreach (User::where('active', true)->cursor() as $user) {
    $user->updateLastSeen();
}

// With lazy() for additional collection methods
User::where('verified', true)
    ->cursor()
    ->filter(fn($user) => $user->hasCompletedProfile())
    ->each(fn($user) => SendWelcomeEmail::dispatch($user));

Real-World Examples

Export Large CSV

class ExportUsersCommand extends Command
{
    public function handle()
    {
        $file = fopen('users.csv', 'w');

        // Write headers
        fputcsv($file, ['ID', 'Name', 'Email', 'Created At']);

        // Process in chunks to avoid memory issues
        User::select('id', 'name', 'email', 'created_at')
            ->chunkById(500, function ($users) use ($file) {
                foreach ($users as $user) {
                    fputcsv($file, [
                        $user->id,
                        $user->name,
                        $user->email,
                        $user->created_at->toDateTimeString(),
                    ]);
                }

                // Optional: Show progress
                $this->info("Processed up to ID: {$users->last()->id}");
            });

        fclose($file);
        $this->info('Export completed!');
    }
}

Batch Email Campaign

class SendCampaignJob implements ShouldQueue
{
    public function handle()
    {
        $campaign = Campaign::find($this->campaignId);

        // Process subscribers in chunks
        $campaign->subscribers()
            ->where('unsubscribed', false)
            ->chunkById(50, function ($subscribers) use ($campaign) {
                foreach ($subscribers as $subscriber) {
                    SendCampaignEmail::dispatch($campaign, $subscriber)
                        ->onQueue('emails')
                        ->delay(now()->addSeconds(rand(1, 10)));
                }

                // Prevent rate limiting
                sleep(2);
            });
    }
}

Data Migration/Transformation

class MigrateUserData extends Command
{
    public function handle()
    {
        $bar = $this->output->createProgressBar(User::count());

        User::with(['profile', 'settings'])
            ->chunkById(100, function ($users) use ($bar) {
                DB::transaction(function () use ($users, $bar) {
                    foreach ($users as $user) {
                        // Complex transformation
                        $newData = $this->transformUserData($user);

                        NewUserModel::create($newData);

                        $bar->advance();
                    }
                });
            });

        $bar->finish();
        $this->newLine();
        $this->info('Migration completed!');
    }
}

Cleanup Old Records

class CleanupOldLogs extends Command
{
    public function handle()
    {
        $deletedCount = 0;

        ActivityLog::where('created_at', '<', now()->subMonths(6))
            ->chunkById(1000, function ($logs) use (&$deletedCount) {
                $ids = $logs->pluck('id')->toArray();

                // Batch delete for efficiency
                ActivityLog::whereIn('id', $ids)->delete();

                $deletedCount += count($ids);

                $this->info("Deleted {$deletedCount} records so far...");

                // Give database a breather
                usleep(100000); // 100ms
            });

        $this->info("Total deleted: {$deletedCount}");
    }
}

Choosing the Right Method

Performance Optimization Tips

1. Select Only Needed Columns

User::select('id', 'email', 'name')
    ->chunkById(100, function ($users) {
        // Process with minimal data
    });

2. Use Indexes

// Ensure indexed columns in where clauses
User::where('status', 'active') // status should be indexed
    ->where('created_at', '>', $date) // created_at should be indexed
    ->chunkById(200, function ($users) {
        // Process efficiently
    });

3. Disable Eloquent Events When Appropriate

User::withoutEvents(function () {
    User::chunkById(500, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]);
        }
    });
});

4. Use Raw Queries for Bulk Updates

// Instead of updating each record
User::chunkById(100, function ($users) {
    $ids = $users->pluck('id')->toArray();

    // Bulk update with raw query
    DB::table('users')
        ->whereIn('id', $ids)
        ->update([
            'last_processed_at' => now(),
            'processing_count' => DB::raw('processing_count + 1'),
        ]);
});

5. Queue Large Operations

class ProcessLargeDataset extends Command
{
    public function handle()
    {
        User::chunkById(100, function ($users) {
            ProcessUserBatch::dispatch($users->pluck('id'))
                ->onQueue('heavy-processing');
        });
    }
}

class ProcessUserBatch implements ShouldQueue
{
    public function __construct(
        public Collection $userIds
    ) {}

    public function handle()
    {
        User::whereIn('id', $this->userIds)
            ->get()
            ->each(fn($user) => $user->process());
    }
}

Testing Chunked Operations

test('processes all active users in chunks', function () {
    // Create test data
    User::factory()->count(150)->create(['active' => true]);
    User::factory()->count(50)->create(['active' => false]);

    $processed = [];

    User::where('active', true)
        ->chunkById(50, function ($users) use (&$processed) {
            foreach ($users as $user) {
                $processed[] = $user->id;
            }
        });

    expect($processed)->toHaveCount(150);
    expect(count(array_unique($processed)))->toBe(150);
});

test('handles empty datasets gracefully', function () {
    $callCount = 0;

    User::where('id', '<', 0) // No results
        ->chunk(100, function ($users) use (&$callCount) {
            $callCount++;
        });

    expect($callCount)->toBe(0);
});

Common Pitfalls

Modifying filter columns during chunk()

// WRONG: May skip records
User::where('processed', false)
    ->chunk(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]); // Changes the WHERE condition!
        }
    });

// CORRECT: Use chunkById()
User::where('processed', false)
    ->chunkById(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]);
        }
    });

Not handling chunk callback returns

// Return false to stop chunking
User::chunk(100, function ($users) {
    foreach ($users as $user) {
        if ($user->hasIssue()) {
            return false; // Stop processing
        }
        $user->process();
    }
});

Ignoring database connection limits

// Consider connection timeouts for long operations
DB::connection()->getPdo()->setAttribute(PDO::ATTR_TIMEOUT, 3600);

User::chunkById(100, function ($users) {
    // Long running process
});

Remember: When dealing with large datasets, always think about memory usage, query efficiency, and processing time. Chunk your data appropriately!

Adoption

noartem/laravel-data-chunking-large-datasets

$ install --global

Security Scan Results

SKILL.md

Data Chunking for Large Datasets

The Problem: Memory Exhaustion

Solution: Chunking Methods

1. Basic Chunking with chunk()

2. Chunk By ID for Safer Updates

3. Lazy Collections for Memory Efficiency

4. Cursor for Forward-Only Iteration

Real-World Examples

Export Large CSV

Batch Email Campaign

Data Migration/Transformation

Cleanup Old Records

Choosing the Right Method

Performance Optimization Tips

1. Select Only Needed Columns

2. Use Indexes

3. Disable Eloquent Events When Appropriate

4. Use Raw Queries for Bulk Updates

5. Queue Large Operations

Testing Chunked Operations

Common Pitfalls

Related Skills

noartem/xlsx

noartem/webapp-testing

noartem/web-design-guidelines

noartem/web-artifacts-builder

noartem/laravel-data-chunking-large-datasets

$ install --global

Security Scan Results

SKILL.md

Data Chunking for Large Datasets

The Problem: Memory Exhaustion

Solution: Chunking Methods

1. Basic Chunking with chunk()

2. Chunk By ID for Safer Updates

3. Lazy Collections for Memory Efficiency

4. Cursor for Forward-Only Iteration

Real-World Examples

Export Large CSV

Batch Email Campaign

Data Migration/Transformation

Cleanup Old Records

Choosing the Right Method

Performance Optimization Tips

1. Select Only Needed Columns

2. Use Indexes

3. Disable Eloquent Events When Appropriate

4. Use Raw Queries for Bulk Updates

5. Queue Large Operations

Testing Chunked Operations

Common Pitfalls

Related Skills

noartem/xlsx

noartem/webapp-testing

noartem/web-design-guidelines

noartem/web-artifacts-builder

1. Basic Chunking with `chunk()`

1. Basic Chunking with `chunk()`