Building Newsletter to PDF Utility With Puppeteer, Hummus , and Cheerio

Today, I had an idea to build a command-line utility to convert technical newsletter like hackernewsletter to PDF. This will enable me to read newsletter offline while travelling. After spending couple of hours, the first version of the utility is ready. In this post, I will share how I was able to quickly build the working version of the utility. While building, I learnt about couple of cool node modules that made it easy for me to build the utility. I was amazed by the rapid prototyping capabilities of the node ecosystem.

What we want to build?

We want to build newsletter-to-pdf utility that given a newsletter issue URL will generate a single PDF with all the content of the stories.

To build this utility, we need to perform following tasks:

  1. Given a newsletter URL, find all the story URLs.
  2. For each story URL, generate a PDF.
  3. Combine individual PDFs into a single PDF.

Step 1: Find all the story urls from a newsletter URL

Most newsletter allow users to view in a browser. For example, http://mailchi.mp/hackernewsletter/373?e=2679b477c5 is the URL for issue #373 of hackernewsletter. Each newsletter has a set of stories that a reader can read. The first thing we have to do is to find all the URLs that correspond to a story.  To accomplish this task, I made use of Cheerio and request libraries. I used request-promise library so that I can use request with Promise API support. The code shown below extracts all the anchor tags whose title contains Votes text.

const rp = require('request-promise');
const cheerio = require('cheerio');
function extractLinksFromUrl(url) {
    var options = {
        uri: url,
        transform: function (body) {
            return cheerio.load(body);
        }
    };
    return rp(options)
        .then(function ($) {
            const links = $('a').filter(function (i, el) {
                const titleAttr = $(this).attr('title');
                return titleAttr && titleAttr.includes('Votes');
            })
            console.log('links', links.length);
            return $(links).map((function (i, link) {
                return new Story($(this).attr('href'), $(this).text());
            })).get();
        }).catch(function (err) {
            console.log('Encountered error ', err);
        })

}
class Story {
    constructor(url, title) {
        this.url = url;
        this.title = title;
    }
}

Step 2: Generate PDF for each story

Next step is to generate PDF for each story.  For this I made use of Google’s puppeteer module. Pupeeter is headless Chrome Node API that you can use to generate PDFs, screenshots, scrape content of website etc.

async function generatePdf(url, outputDir, filename) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setRequestInterceptionEnabled(true);
    page.on('request', request => {
        if (request.url.includes('disqus'))
            request.abort();
        else
            request.continue();
    });
    await page.goto(url,
        {
            waitUntil: 'networkidle',
            networkIdleTimeout: 5000,
            timeout: 3000000
        }
    );
    await page.pdf({
        path: path.join(outputDir, filename),
        format: 'A4'
    });

    await browser.close();
}

In the code shown above we use standard Puppeteer API to generate PDF for a URL. One thing that you would notice is the use of request interception. Puppeteer allows you to intercept the request and you can decide whether you want to make the request or not. This gives you the flexibility to block ads or comment sharing sites like disqus. I aborted all the requests to disqus as I only care about article content.

Step 3: Combine the individual stories into one PDF

It took me sometime to figure out how to do it. After step 2, I was able to generate PDF for each story URL. I wanted to generate a single PDF for the entire newsletter content. This means I have to merge all the PDFs into one. After a bit of googling, I was able to find a library with name hummus. Hummus is a node module for creating, parsing, and modifying PDFs. I relied on its ability to append a PDF to a target PDF. In this code shown below, we created a new PDF file and appended pages of an existing PDF to it. This made it feasible to create a single PDF with all PDFs content.

function combinePdfs(files){
    const pdfWriter = hummus.createWriter('newsletter.pdf');
    files
        .filter(file => file.endsWith(".pdf"))
        .forEach(fn => {
            pdfWriter.appendPDFPagesFromPDF(file);
        });

    pdfWriter.end();
}

I will publish the node module in next few days after polishing it a bit. Let me know what features you would like to have in this utility.

 

Amazon ECS: The Modern Cluster Manager Part 1

In the last few posts, we looked at various Docker utilities and how XL Deploy can make it easy for enterprises to adopt and use Docker. Docker streamlines software development and testing for teams that have started embracing it. The package once deploy anywhere (PODA) capability of Docker minimises the issue of environmental (like staging, quality assurance, and production) differences. Continue reading “Amazon ECS: The Modern Cluster Manager Part 1”

Ayrton Senna Life Documentary

Yesterday, I watched documentary on Ayrton Senna life. Ayrton Senna was Brazilian racing driver who died in an accident  while he was leading the 1994 San Marino Grand Prix at the Autodromo Enzo e Dino Ferrari in Italy. He was three Formula-One World champion.  The guy was so humble and down to earth even after achieving so much success in life. He fought politics and system to become world’s fastest driver. Beautiful documentary on the life of one of the greatest drivers.