Talk: Java Performance Puzzlers by Douglas Hawkins

Advertisements

Building Newsletter to PDF Utility With Puppeteer, Hummus , and Cheerio

Today, I had an idea to build a command-line utility to convert technical newsletter like hackernewsletter to PDF. This will enable me to read newsletter offline while travelling. After spending couple of hours, the first version of the utility is ready. In this post, I will share how I was able to quickly build the working version of the utility. While building, I learnt about couple of cool node modules that made it easy for me to build the utility. I was amazed by the rapid prototyping capabilities of the node ecosystem.

What we want to build?

We want to build newsletter-to-pdf utility that given a newsletter issue URL will generate a single PDF with all the content of the stories.

To build this utility, we need to perform following tasks:

  1. Given a newsletter URL, find all the story URLs.
  2. For each story URL, generate a PDF.
  3. Combine individual PDFs into a single PDF.

Step 1: Find all the story urls from a newsletter URL

Most newsletter allow users to view in a browser. For example, http://mailchi.mp/hackernewsletter/373?e=2679b477c5 is the URL for issue #373 of hackernewsletter. Each newsletter has a set of stories that a reader can read. The first thing we have to do is to find all the URLs that correspond to a story.  To accomplish this task, I made use of Cheerio and request libraries. I used request-promise library so that I can use request with Promise API support. The code shown below extracts all the anchor tags whose title contains Votes text.

const rp = require('request-promise');
const cheerio = require('cheerio');
function extractLinksFromUrl(url) {
    var options = {
        uri: url,
        transform: function (body) {
            return cheerio.load(body);
        }
    };
    return rp(options)
        .then(function ($) {
            const links = $('a').filter(function (i, el) {
                const titleAttr = $(this).attr('title');
                return titleAttr && titleAttr.includes('Votes');
            })
            console.log('links', links.length);
            return $(links).map((function (i, link) {
                return new Story($(this).attr('href'), $(this).text());
            })).get();
        }).catch(function (err) {
            console.log('Encountered error ', err);
        })

}
class Story {
    constructor(url, title) {
        this.url = url;
        this.title = title;
    }
}

Step 2: Generate PDF for each story

Next step is to generate PDF for each story.  For this I made use of Google’s puppeteer module. Pupeeter is headless Chrome Node API that you can use to generate PDFs, screenshots, scrape content of website etc.

async function generatePdf(url, outputDir, filename) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setRequestInterceptionEnabled(true);
    page.on('request', request => {
        if (request.url.includes('disqus'))
            request.abort();
        else
            request.continue();
    });
    await page.goto(url,
        {
            waitUntil: 'networkidle',
            networkIdleTimeout: 5000,
            timeout: 3000000
        }
    );
    await page.pdf({
        path: path.join(outputDir, filename),
        format: 'A4'
    });

    await browser.close();
}

In the code shown above we use standard Puppeteer API to generate PDF for a URL. One thing that you would notice is the use of request interception. Puppeteer allows you to intercept the request and you can decide whether you want to make the request or not. This gives you the flexibility to block ads or comment sharing sites like disqus. I aborted all the requests to disqus as I only care about article content.

Step 3: Combine the individual stories into one PDF

It took me sometime to figure out how to do it. After step 2, I was able to generate PDF for each story URL. I wanted to generate a single PDF for the entire newsletter content. This means I have to merge all the PDFs into one. After a bit of googling, I was able to find a library with name hummus. Hummus is a node module for creating, parsing, and modifying PDFs. I relied on its ability to append a PDF to a target PDF. In this code shown below, we created a new PDF file and appended pages of an existing PDF to it. This made it feasible to create a single PDF with all PDFs content.

function combinePdfs(files){
    const pdfWriter = hummus.createWriter('newsletter.pdf');
    files
        .filter(file => file.endsWith(".pdf"))
        .forEach(fn => {
            pdfWriter.appendPDFPagesFromPDF(file);
        });

    pdfWriter.end();
}

I will publish the node module in next few days after polishing it a bit. Let me know what features you would like to have in this utility.

 

Amazon ECS: The Modern Cluster Manager Part 1

In the last few posts, we looked at various Docker utilities and how XL Deploy can make it easy for enterprises to adopt and use Docker. Docker streamlines software development and testing for teams that have started embracing it. The package once deploy anywhere (PODA) capability of Docker minimises the issue of environmental (like staging, quality assurance, and production) differences. Continue reading “Amazon ECS: The Modern Cluster Manager Part 1”

Ayrton Senna Life Documentary

Yesterday, I watched documentary on Ayrton Senna life. Ayrton Senna was Brazilian racing driver who died in an accident  while he was leading the 1994 San Marino Grand Prix at the Autodromo Enzo e Dino Ferrari in Italy. He was three Formula-One World champion.  The guy was so humble and down to earth even after achieving so much success in life. He fought politics and system to become world’s fastest driver. Beautiful documentary on the life of one of the greatest drivers.

Why I dislike meetings?

One aspect of work that I don’t like is meetings. The reason I have hard time at meetings is because they force you to come up with a magical solution to a problem that you don’t understand completely. You are forced to put forward your incomplete thoughts just to justify your presence in the meeting.  This then turns out to a mess where everyone tries to run over each other. It becomes a battle of who can come up with a sentence that has all the right keywords that decision makers want to listen. Rather than listening to others and logically arguing about each other thoughts our mind is thinking what to say next to score brownie points.

I don’t have a solution to this problem. I try to be quiet in such meetings that makes me ineligible for the follow up meeting 🙂

Understanding Akka Dispatchers

This week I had to work on tuning a execution engine that is built using Akka. Akka is a toolkit and runtime for building highly concurrent, distributed and resilient message driven systems. This post assumes you already know Akka. Actor needs a dispatcher to perform its task. A dispatcher relies on executor to provide thread. There are two types of executors a dispatcher can have: 1) fork-join-executor 2) thread-pool-executor. In this post, we will understand how you can configure fork-join-executor and thread-pool-executor to meet your needs. Continue reading “Understanding Akka Dispatchers”

Actor System Termination on JVM Shutdown

In my day job, I work on a backend system that uses Akka. Akka is a toolkit and runtime for building highly concurrent, distributed and resilient message driven systems. I will write an in-depth Akka tutorial some other week. This week I will talk about a specific problem that I was trying to solve. We have two applications that talk over each other via Akka remoting. First application can shutdown the second application programmatically by sending a message to the second application ActorSystem. Shutdown here means you can exit the JVM. To make sure we do a clean shutdown, we added JVM shutdown hook that terminates the ActorSystem. Continue reading “Actor System Termination on JVM Shutdown”