Building Newsletter to PDF Utility With Puppeteer, Hummus , and Cheerio

Today, I had an idea to build a command-line utility to convert technical newsletter like hackernewsletter to PDF. This will enable me to read newsletter offline while travelling. After spending couple of hours, the first version of the utility is ready. In this post, I will share how I was able to quickly build the working version of the utility. While building, I learnt about couple of cool node modules that made it easy for me to build the utility. I was amazed by the rapid prototyping capabilities of the node ecosystem.

What we want to build?

We want to build newsletter-to-pdf utility that given a newsletter issue URL will generate a single PDF with all the content of the stories.

To build this utility, we need to perform following tasks:

  1. Given a newsletter URL, find all the story URLs.
  2. For each story URL, generate a PDF.
  3. Combine individual PDFs into a single PDF.

Step 1: Find all the story urls from a newsletter URL

Most newsletter allow users to view in a browser. For example, http://mailchi.mp/hackernewsletter/373?e=2679b477c5 is the URL for issue #373 of hackernewsletter. Each newsletter has a set of stories that a reader can read. The first thing we have to do is to find all the URLs that correspond to a story.  To accomplish this task, I made use of Cheerio and request libraries. I used request-promise library so that I can use request with Promise API support. The code shown below extracts all the anchor tags whose title contains Votes text.

const rp = require('request-promise');
const cheerio = require('cheerio');
function extractLinksFromUrl(url) {
    var options = {
        uri: url,
        transform: function (body) {
            return cheerio.load(body);
        }
    };
    return rp(options)
        .then(function ($) {
            const links = $('a').filter(function (i, el) {
                const titleAttr = $(this).attr('title');
                return titleAttr && titleAttr.includes('Votes');
            })
            console.log('links', links.length);
            return $(links).map((function (i, link) {
                return new Story($(this).attr('href'), $(this).text());
            })).get();
        }).catch(function (err) {
            console.log('Encountered error ', err);
        })

}
class Story {
    constructor(url, title) {
        this.url = url;
        this.title = title;
    }
}

Step 2: Generate PDF for each story

Next step is to generate PDF for each story.  For this I made use of Google’s puppeteer module. Pupeeter is headless Chrome Node API that you can use to generate PDFs, screenshots, scrape content of website etc.

async function generatePdf(url, outputDir, filename) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setRequestInterceptionEnabled(true);
    page.on('request', request => {
        if (request.url.includes('disqus'))
            request.abort();
        else
            request.continue();
    });
    await page.goto(url,
        {
            waitUntil: 'networkidle',
            networkIdleTimeout: 5000,
            timeout: 3000000
        }
    );
    await page.pdf({
        path: path.join(outputDir, filename),
        format: 'A4'
    });

    await browser.close();
}

In the code shown above we use standard Puppeteer API to generate PDF for a URL. One thing that you would notice is the use of request interception. Puppeteer allows you to intercept the request and you can decide whether you want to make the request or not. This gives you the flexibility to block ads or comment sharing sites like disqus. I aborted all the requests to disqus as I only care about article content.

Step 3: Combine the individual stories into one PDF

It took me sometime to figure out how to do it. After step 2, I was able to generate PDF for each story URL. I wanted to generate a single PDF for the entire newsletter content. This means I have to merge all the PDFs into one. After a bit of googling, I was able to find a library with name hummus. Hummus is a node module for creating, parsing, and modifying PDFs. I relied on its ability to append a PDF to a target PDF. In this code shown below, we created a new PDF file and appended pages of an existing PDF to it. This made it feasible to create a single PDF with all PDFs content.

function combinePdfs(files){
    const pdfWriter = hummus.createWriter('newsletter.pdf');
    files
        .filter(file => file.endsWith(".pdf"))
        .forEach(fn => {
            pdfWriter.appendPDFPagesFromPDF(file);
        });

    pdfWriter.end();
}

I will publish the node module in next few days after polishing it a bit. Let me know what features you would like to have in this utility.

 

Advertisements

Day 27: Restify–Build Correct REST Web Services in Node.js

Today for my 30 day challenge, I decided to learn a Node.js module called restify.The restify module makes it very easy to write correct RESTful APIs in Node.js and provides out-of-the-box support for features like versioning, error handling, CORS, and content negotiation. It borrows heavily from Express (intentionally) as that is more or less the de facto API for writing web applications on top of node.js. In this blog post, we will develop a RESTful API for storing jobs. We will store the data in MongoDB. Read full blog here https://www.openshift.com/blogs/day-27-restify-build-correct-rest-web-services-in-nodejs

Day 15: Meteor–Building a Web App From Scratch in Meteor

So far in this series we have looked at BowerAngularJSGruntJS, and PhoneGap JavaScript technologies. Today for my 30 day challenge, I decided to go back to JavaScript and learn a framework called Meteor. Although Meteor has a very good documentation, but it misses a beginner tutorial. I learn better from tutorials as they help me get started with a technology quickly. In this blog, we will learn how to build an epoll application using Meteor framework. Read the full blog here https://www.openshift.com/blogs/day-15-meteor-building-a-web-app-from-scratch-in-meteor

Day 8 : Harp — The Modern Static Web Server

So far in the 30 technologies in 30 days blog series, we have looked at Bower, AngularJS, and GruntJS JavaScript technologies. These latest technologies are designed and developed to make developer life easier and more productive. Today, we will learn another new JavaScript technology called Harp. In this blog, we will start with the basics of Harp, and then we will create a simple Harp application, and finally deploy the application to OpenShift. Read the full blog at https://www.openshift.com/blogs/day-8-harp-the-modern-static-web-server

Day 5 : GruntJS : Let Someone Else Do My Tedious Repetitive Tasks

Today I decided to learn GruntJS. GruntJS is a JavaScript based command line build tool. It can help us automate repetitive tasks. We can think of it as JavaScript alternative to Make or Ant. It can perform tasks like minification, compilation , unit testing, linting, etc. As more and more development moves towards client side, it makes a lot more sense to use tools which can help us become more productive. In this blog, I will show how we can use GruntJS to minify javascript files. Then using GruntJS markdown plugin we will convert a markdown document to HTML5 document. Let’s get started. The full blog series can be tracked on this page.

Why should we care?

The main reason why we should learn GruntJS is that developers by nature are lazy creatures. They make mistakes when doing repetitive tedious manual tasks. Continue reading “Day 5 : GruntJS : Let Someone Else Do My Tedious Repetitive Tasks”

Day 1 : Bower — Manage Your Client Side Dependencies

From today onwards I have taken a challenge that I will learn a new technology every day for a month and will blog about it. I will also try to do a small screencast.  After my normal office working hours I will spend couple of hours learning a new technology and one hour writing about that. The goal of this activity is to get familiar with lot of new things happening in community. My main focus would be on JavaScript and related technologies but it can be Java or other technology as well. There will be some technologies on which I might spend multiple days but I will pick new topic each time within that technology. Wherever it makes sense I will try to show how it can work with OpenShift. So, I am expecting it to be fun and a good learning experience.

As my first topic I have decided to learn about Bower. Continue reading “Day 1 : Bower — Manage Your Client Side Dependencies”