2013-03-19
Today, I've written a webservice to concatenate PDF files. Op-
tionally, the text contents is converted to image contents to
prevent straight forward copy'n'paste of the text. (I don't want
to discuss the sense in this detextification; I have not written
this program for myself.)
The core are shell scripts that use gs(1), convert(1) (in Im-
ageMagick) and tiff2pdf(1) (in libtiff-tools). A PHP script
serves as the web interface.
Pdfconcat is straight forward:
#!/bin/sh
#
# concatenate the given PDF files to stdout
gs -q -dNOPAUSE -dBATCH -sPAPERSIZE=a4 -dPDFSETTINGS=/prepress \
-sDEVICE=pdfwrite -sOutputFile=- "$@"
Pdfdetextify needed much more research. It is important to
compress the files to avoid wasting hundreds of megabytes. (Sad-
ly, tiff2pdf(1) can not read from stdin.)
#!/bin/sh
#
# convert pdf to tiff and back to pdf in order to convert text to image
# writes to stdout
#
# depends on: gs, libtiff-tools (tiff2pdf)
temp="`mktemp /tmp/${0##*/}.XXXXXX`"
trap 'rm -f "$temp"' 0 1 2 3 15
for i do
# echo "processing $i"
gs -q -dNOPAUSE -dBATCH -sPAPERSIZE=a4 -dPDFSETTINGS=/prepress \
-r300 -o "$temp" -sDEVICE=tiffgray \
-sCompression=lzw "$i"
tiff2pdf -z "$temp"
done
Pdfconcat.php is the web interface script that glues the stuff
together. It is more of a hack:
PDF concat and detextify
&1 >%s", PDFDETEXTIFY, $file, $newfile);
system($cmd);
return $newfile;
}
function
concatpdfs($files)
{
$newfile = sprintf("%s/%s/%s.pdf", dirname(__FILE__), UPLOADDIR,
date('Y-m-d_H-i-s'));
$cmd = sprintf("%s %s 2>&1 >%s", PDFCONCAT, implode(' ', $files),
$newfile);
system($cmd);
foreach ($files as $file) {
unlink($file);
}
return sprintf("%s/%s", UPLOADDIR, basename($newfile));
}
function
procfiles()
{
$files = array();
foreach ($_FILES as $key => $val) {
if ($val['error'] == UPLOAD_ERR_NO_FILE) {
continue;
}
if ($val['error'] > 0) {
echo "Error: $val[error]";
echo "Skipping $val[name] ...0;
continue;
}
if (isset($_POST[$key.'detextify']) &&
$_POST[$key.'detextify'] == 'on') {
$files[] = detextify($val['tmp_name']);
} else {
$files[] = $val['tmp_name'];
}
}
return concatpdfs($files);
}
// main()
if (isset($_POST['submit'])) {
$outfile = procfiles();
echo '
';
echo '';
echo '
';
}
?>
The final component is a cron job that removes the generated
files after some time:
#!/bin/sh
#
# print list of old files from the upload directory
# output is meant to be piped into: `xargs rm'
if [ $# -lt 2 ] ; then
echo "usage: ${0##*/} NUM_OF_DAYS_TO_KEEP DIR..." >&2
exit 1
fi
days="$1"
shift
find "$@" -maxdepth 1 -mindepth 1 -atime +"$days" -print
http://marmaro.de/lue/ markus schnalke