Skip to content

Instantly share code, notes, and snippets.

@rodrigopedra
Created April 17, 2022 04:22
Show Gist options
  • Select an option

  • Save rodrigopedra/3fc18f8af4bb0ff49f760b2e2b278061 to your computer and use it in GitHub Desktop.

Select an option

Save rodrigopedra/3fc18f8af4bb0ff49f760b2e2b278061 to your computer and use it in GitHub Desktop.
PHP import CSV
<?php
$file = fopen('php://output', 'w');
fputcsv($file, ['ID', 'NAME', 'AGE']);
for ($index = 0; $index < 21_000_000; $index++) {
fputcsv($file, [$index + 1, bin2hex(random_bytes(20)), random_int(18, 65)]);
}
fclose($file);
<?php
define('BUFFER_SIZE', 500);
$pdo = new PDO('mysql:host=127.0.0.1;dbname=dummy;charset=utf8mb4', 'root', 'password', [
PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$pdo->exec('DROP TABLE IF EXISTS `import`');
$pdo->exec('CREATE TABLE `import` (
`id` BIGINT UNSIGNED NOT NULL,
`name` VARCHAR(100) NOT NULL,
`age` INT UNSIGNED NOT NULL
)');
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES ' . implode(',', array_fill(0, BUFFER_SIZE, '(?, ?, ?)')));
echo date('Y-m-d H:i:s'), PHP_EOL;
$file = fopen($argv[1], 'r');
fgetcsv($file, 100); // skip first line
$buffer = [];
while (!feof($file)) {
$fields = fgetcsv($file, 100);
if ($fields === false) {
continue;
}
$buffer[] = $fields;
if (count($buffer) === BUFFER_SIZE) {
$statement->execute(array_merge(...$buffer));
$buffer = [];
}
}
fclose($file);
$remaining = count($buffer);
if ($remaining > 0) {
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES ' . implode(',', array_fill(0, $remaining, '(?, ?, ?)')));
$statement->execute(array_merge(...$buffer));
}
echo date('Y-m-d H:i:s'), PHP_EOL;
$value = $pdo->query('SELECT COUNT(*) FROM `import`')->fetchColumn();
echo number_format($value), PHP_EOL;
<?php
$pdo = new PDO('mysql:host=127.0.0.1;dbname=dummy;charset=utf8mb4', 'root', 'password', [
PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
]);
$pdo->exec('DROP TABLE IF EXISTS `import`');
$pdo->exec('CREATE TABLE `import` (
`id` BIGINT UNSIGNED NOT NULL,
`name` VARCHAR(100) NOT NULL,
`age` INT UNSIGNED NOT NULL
)');
$statement = $pdo->prepare('INSERT INTO `import`(`id`, `name`, `age`) VALUES (:id, :name, :age)');
echo date('Y-m-d H:i:s'), PHP_EOL;
$file = fopen($argv[1], 'r');
fgetcsv($file, 100); // skip first line
while (!feof($file)) {
$fields = fgetcsv($file, 100);
if ($fields === false) {
continue;
}
$statement->execute([
':id' => $fields[0],
':name' => $fields[1],
':age' => $fields[2],
]);
}
fclose($file);
echo date('Y-m-d H:i:s'), PHP_EOL;
$value = $pdo->query('SELECT COUNT(*) FROM `import`')->fetchColumn();
echo number_format($value), PHP_EOL;
$ php generate.php > data.csv
$ ls -lh data.csv
-rw-r--r-- 1 rodrigo rodrigo 1,1G abr 17 01:02 data.csv
$ php import-linear.php data.csv
2022-04-17 01:02:50
2022-04-17 01:17:00
21,000,000
$ php import-buffered.php data.csv
2022-04-17 01:17:16
2022-04-17 01:19:12
21,000,000
@rodrigopedra
Copy link
Author

The other point is around the downtime aspect. I am indeed running this import script daily. I do exactly as you did in your sample and drop the table, then create it again before the fresh import. My concern is that I'll have live requests coming into the application which rely on this data as a lookup. So if I continue with this setup then there will be lookups happening on this table as it's dropped, created and re-populated. I'm wondering if I should do the creation into a my_table_name_temp and then once completed drop the my_table_name and rename my_table_name_temp to my_table_name. Again, would welcome your thoughts on this.

You can:

  • take the app down while updating, for example Laravel has a maintenance mode specifically for this use-case which will return a 503 response to any requests while your app is down for maintenance.
  • Or use an strategy like the one you described: dropping the old one then rename the new one.

I usually, when I dealt with large databases, used a mix of both. I ingested data into a temporary table (actually a real table I create with the same data structure as the target table), once data is ingested and verified, I would drop the old one and rename the new one.

Note, that if you foreign indices that references the table being replaced you will need to:

  1. disable foreign indices on the database
  2. drop the indices
  3. drop the old table
  4. rename the new one
  5. recreate the foreign indices
  6. re-enable foreign indices on the database.

There are other strategies you can use, if you can't afford any downtime:

  • You can create a replica database, update the replica, and then switch connection on the app to the replica.

Another one is using transactions, which will lock your app for a while, but both database and app can stay alive, requests are just going to be delayed. This is very similar to the drop/rename strategy above:

  1. Ingest new data into a temporary table
  2. disable foreign indices on the database
  3. start a transaction
  4. delete all data on the target table (yes, I am talking about a delete without a WHERE)
  5. insert the data from the temporary table into the target table (use INSERT with SELECT for this
  6. commit the transaction
  7. re-enable foreign indices on the database.
  8. Drop the temporary table we used to ingest the new data

The main difference here is the transaction usage, which will try to guarantee the database stays on a good state once data is committed.

This approach have some downsides:

  • You need sufficient disk space, as you are dealing with a big dataset, as basically you are going to hold 3 copies of the data while the transaction is happening.
    • Aside of the two tables, database servers copies the data altered inside a transaction to some sort of "undo" storage, in case you need to rollback the transaction instead of committing it
  • Reads on the database are usually fine while the transaction is running, but writes that depend on the data inside the transaction will be queued up, which could lead you to some timeouts.

Hope this ideas help =)

@rodrigopedra
Copy link
Author

@jamesmills a final disclaimer:

Although I spent some years (about 7 years) dealing with large datasets, I worked more on the ETL and analysis side of the job, and not the database management.

Of course along the years you learn a trick or two when dealing over and over with the same problem set. But the options I gave above, specially on the second comment on your additional questions, might no be the best solution.

Those are strategies I would do, from my experience, but if my team had an expert on database management, I would talk to them first to assess if this is the best strategy.

So if anyone else reading this knows better, please share in the comments how would you handle these. Thanks =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment