2013-03-19 Today, I've written a webservice to concatenate PDF files. Op- tionally, the text contents is converted to image contents to prevent straight forward copy'n'paste of the text. (I don't want to discuss the sense in this detextification; I have not written this program for myself.) The core are shell scripts that use gs(1), convert(1) (in Im- ageMagick) and tiff2pdf(1) (in libtiff-tools). A PHP script serves as the web interface. Pdfconcat is straight forward: #!/bin/sh # # concatenate the given PDF files to stdout gs -q -dNOPAUSE -dBATCH -sPAPERSIZE=a4 -dPDFSETTINGS=/prepress \ -sDEVICE=pdfwrite -sOutputFile=- "$@" Pdfdetextify needed much more research. It is important to compress the files to avoid wasting hundreds of megabytes. (Sad- ly, tiff2pdf(1) can not read from stdin.) #!/bin/sh # # convert pdf to tiff and back to pdf in order to convert text to image # writes to stdout # # depends on: gs, libtiff-tools (tiff2pdf) temp="`mktemp /tmp/${0##*/}.XXXXXX`" trap 'rm -f "$temp"' 0 1 2 3 15 for i do # echo "processing $i" gs -q -dNOPAUSE -dBATCH -sPAPERSIZE=a4 -dPDFSETTINGS=/prepress \ -r300 -o "$temp" -sDEVICE=tiffgray \ -sCompression=lzw "$i" tiff2pdf -z "$temp" done Pdfconcat.php is the web interface script that glues the stuff together. It is more of a hack: PDF concat and detextify &1 >%s", PDFDETEXTIFY, $file, $newfile); system($cmd); return $newfile; } function concatpdfs($files) { $newfile = sprintf("%s/%s/%s.pdf", dirname(__FILE__), UPLOADDIR, date('Y-m-d_H-i-s')); $cmd = sprintf("%s %s 2>&1 >%s", PDFCONCAT, implode(' ', $files), $newfile); system($cmd); foreach ($files as $file) { unlink($file); } return sprintf("%s/%s", UPLOADDIR, basename($newfile)); } function procfiles() { $files = array(); foreach ($_FILES as $key => $val) { if ($val['error'] == UPLOAD_ERR_NO_FILE) { continue; } if ($val['error'] > 0) { echo "Error: $val[error]"; echo "Skipping $val[name] ...0; continue; } if (isset($_POST[$key.'detextify']) && $_POST[$key.'detextify'] == 'on') { $files[] = detextify($val['tmp_name']); } else { $files[] = $val['tmp_name']; } } return concatpdfs($files); } // main() if (isset($_POST['submit'])) { $outfile = procfiles(); echo '
'; echo '

The generated PDF

'; echo '
'; } ?>

(Only PDF files smaller than )

detextify?
detextify?
detextify?
detextify?
detextify?

The final component is a cron job that removes the generated files after some time: #!/bin/sh # # print list of old files from the upload directory # output is meant to be piped into: `xargs rm' if [ $# -lt 2 ] ; then echo "usage: ${0##*/} NUM_OF_DAYS_TO_KEEP DIR..." >&2 exit 1 fi days="$1" shift find "$@" -maxdepth 1 -mindepth 1 -atime +"$days" -print http://marmaro.de/lue/ markus schnalke