Building Newsletter to PDF Utility With Puppeteer, Hummus , and Cheerio

Today, I had an idea to build a command-line utility to convert technical newsletter like hackernewsletter to PDF. This will enable me to read newsletter offline while travelling. After spending couple of hours, the first version of the utility is ready. In this post, I will share how I was able to quickly build the working version of the utility. While building, I learnt about couple of cool node modules that made it easy for me to build the utility. I was amazed by the rapid prototyping capabilities of the node ecosystem.

What we want to build?

We want to build newsletter-to-pdf utility that given a newsletter issue URL will generate a single PDF with all the content of the stories.

To build this utility, we need to perform following tasks:

  1. Given a newsletter URL, find all the story URLs.
  2. For each story URL, generate a PDF.
  3. Combine individual PDFs into a single PDF.

Step 1: Find all the story urls from a newsletter URL

Most newsletter allow users to view in a browser. For example, http://mailchi.mp/hackernewsletter/373?e=2679b477c5 is the URL for issue #373 of hackernewsletter. Each newsletter has a set of stories that a reader can read. The first thing we have to do is to find all the URLs that correspond to a story.  To accomplish this task, I made use of Cheerio and request libraries. I used request-promise library so that I can use request with Promise API support. The code shown below extracts all the anchor tags whose title contains Votes text.

const rp = require('request-promise');
const cheerio = require('cheerio');
function extractLinksFromUrl(url) {
    var options = {
        uri: url,
        transform: function (body) {
            return cheerio.load(body);
        }
    };
    return rp(options)
        .then(function ($) {
            const links = $('a').filter(function (i, el) {
                const titleAttr = $(this).attr('title');
                return titleAttr && titleAttr.includes('Votes');
            })
            console.log('links', links.length);
            return $(links).map((function (i, link) {
                return new Story($(this).attr('href'), $(this).text());
            })).get();
        }).catch(function (err) {
            console.log('Encountered error ', err);
        })

}
class Story {
    constructor(url, title) {
        this.url = url;
        this.title = title;
    }
}

Step 2: Generate PDF for each story

Next step is to generate PDF for each story.  For this I made use of Google’s puppeteer module. Pupeeter is headless Chrome Node API that you can use to generate PDFs, screenshots, scrape content of website etc.

async function generatePdf(url, outputDir, filename) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setRequestInterceptionEnabled(true);
    page.on('request', request => {
        if (request.url.includes('disqus'))
            request.abort();
        else
            request.continue();
    });
    await page.goto(url,
        {
            waitUntil: 'networkidle',
            networkIdleTimeout: 5000,
            timeout: 3000000
        }
    );
    await page.pdf({
        path: path.join(outputDir, filename),
        format: 'A4'
    });

    await browser.close();
}

In the code shown above we use standard Puppeteer API to generate PDF for a URL. One thing that you would notice is the use of request interception. Puppeteer allows you to intercept the request and you can decide whether you want to make the request or not. This gives you the flexibility to block ads or comment sharing sites like disqus. I aborted all the requests to disqus as I only care about article content.

Step 3: Combine the individual stories into one PDF

It took me sometime to figure out how to do it. After step 2, I was able to generate PDF for each story URL. I wanted to generate a single PDF for the entire newsletter content. This means I have to merge all the PDFs into one. After a bit of googling, I was able to find a library with name hummus. Hummus is a node module for creating, parsing, and modifying PDFs. I relied on its ability to append a PDF to a target PDF. In this code shown below, we created a new PDF file and appended pages of an existing PDF to it. This made it feasible to create a single PDF with all PDFs content.

function combinePdfs(files){
    const pdfWriter = hummus.createWriter('newsletter.pdf');
    files
        .filter(file => file.endsWith(".pdf"))
        .forEach(fn => {
            pdfWriter.appendPDFPagesFromPDF(file);
        });

    pdfWriter.end();
}

I will publish the node module in next few days after polishing it a bit. Let me know what features you would like to have in this utility.

 

Advertisements

Solving Pusher HTTPS Issue

If you use Pusher then you might face following error. The error happens when you try to use Pusher client library from https.

The page at 'https://shekhargulati.com/blog/1#/' was loaded over HTTPS, but ran insecure content from 'http://js.pusher.com/2.1/pusher.min.js': this content should also be loaded over HTTPS.

To fix this error, you should use Pusher Cloudfront CDN version as shown below. In your HTML or other template use following. Replace 2.1 with your own version.

<script src="//d3dy5gmtp8yhk7.cloudfront.net/2.1/pusher.min.js" type="text/javascript"></script>

Learning JavaScript Programming Language Functions Part 3

Intent of My Blog

Today, i heard the third lecture on javascript functions  by Douglas Crockford. This blog is third in this series. Please refer to first and second post regarding the history and statements in javascript.

Functions in JavaScript

Today i have seen the third lecture of the series.Checkout this lecture

Key points from the presentation are:-
  1. function are first class object
  2. function can be passed, returned and shared just like any other value.
  3. function inherit object and store name value pair.
  4. function are container like objects
  5. functions are equivalent to Lambda
  6. Lambda has enormous expressive power
  7. Unlike most power constructs, lambda is secure
  8. function statement var foo = function foo(){// statements to execute}
  9. In JavaScript, one function can contain other functions.
  10. An inner function has access to the variable and parameter of function that it is contained within.
  11. This is called static scoping or lexical scoping
  12. JavaScript also supports Closure
  13. The scope that an inner function enjoys continues even after the parent function has returned.This is called Closure
  14. Closure are one of the most powerful features of JavaScript
  15. JavaScript is the first lambda language to go mainstream
  16. when a function is called with too many arguments, the extra arguments are ignored
  17. When the function is called with too less arguments, the missing values are set to undefined
  18. Methods can be invoked in four ways
  19. Function Form –> functionObject(argument)
  20. Method Form –> thisObject.methodName(arguments) and thisObject[“methodname”](arguments)
  21. Constructor Form –> new Function(“x”,”y”,”return x*y”)
  22. Apply form –> functionObject.apply(thisObject, arguments)
  23. When a function is invoked, it also gets a special parameter called arguments.
  24. arguments contain all of the arguments from the invocation
  25. It is an array like object (it is not a full array)
  26. arguments.length gives the number of arguments passed.
  27. In JavaScript, you can extend the built-in types(like String, Boolean)
  28. Do not use eval function
  29. Built in wrapper types like String, Boolean, Integer are not useful
  30. Global variables are evil
  31. Implied global are evils too.
  32. Always use functional scope.

These were some of the points from the talk.

Learning Javascript Programming Language Part 2

Intent of my Blog

This blog is a second in the series of my learning javascript programming language. In the first blog i discussed and shared the history of the javascript programming language. For learning javascript, i am following Douglas Crockford videos on YUI theater and book “JavaScript: The Definitive Guide 4th Edition”. In this blog, i will share some of the things that i learned about the language.

Get Started

Today i have seen the second lecture of the series. Checkout this lecture

Key Points from the Presentation are :
  1. The statements in javascript are separated from each other with semicolon. If you place each statement on the separate line, javascript allows you to leave the semicolon.But it is a good idea to put semicolon.
  2. Expression statements are expressions which have a side-effect.
  3. Statements discussed are :- if, switch, while, for, throw, try/catch/finally, function, var, return
  4. If statements is the control statement that allows JavaScript to make decisions,or to execute statements conditionally.
  5. If statement is written like this  if(expression) statement
  6. if the expression is null, undefined, 0,”” or NaN it is converted to false.
  7. Switch statement in JavaScript are different from switch statement in C,C++ or java. In those languages, the case expression must be compile time constant.They must evaluate to integer or other integral types and they must evaluate to same type.
  8. JavaScript switch statement is not nearly as efficient as the switch statement in C, C++, and Java. Since the case expressions in those languages are compile-time constants, they never need to be evaluated at runtime as they are in JavaScript. Furthermore, since the case expressions are integral values in C, C++, and Java, the switch statement can often be implemented using a highly efficient “jump table.”
  9. There is a special version of for loop which exists for objects
    for(var name in object){
    of(object.hasOwnProperty(name)){
    // do something
    }
    }
    
  10. In the var statement, if no initial value is specified for a variable, the value of the variable is undefined
  11. throw statement can throw error or any subclass of error
  12. throw can also be useful to throw a string that contains an error message, or a numeric value that represents some sort of error code.
  13. Do not use with statement because the code that uses with is difficult to optimize.
  14. The try/catch/finally statement is JavaScript’s exception-handling mechanism. The try clause of this statement simply defines the block of code whose exceptions are to be handled. The try block is followed by a catch clause, which is a block of statements that are invoked when an exception occurs anywhere within the try block. The catch clause is followed by a finally block containing cleanup code that is guaranteed to be executed, regardless of what happens in the try block. Both the catch and finally blocks are optional, but a try block must be accompanied by at least one of these blocks.
  15. Every function will have a return statement, sometimes return will return some value and sometime it will be return without any expression.
These were some the important points from the talk. I have not covered functions and objects in this blog. I will share those in future posts.

Learning JavaScript Programming Language Part 1

Intent of my Blog

So finally, i have decided to learn “JavaScript Programming Language”, the world most popular language. In my five years of software development career, i have always tried to run away from learning and working on javascript. But today, i have decided that i will start learning javascript from the beginning. So, in this blog series on javascript i will be sharing my learning on javascript.

Get Started

I googled a bit, to find out the best resources to learn javascript and find out the video series by Douglas Crockford.

So, in today’s post i will be writing down the key points from his lecture.I would recommend that you should listen this presentation.

Key Points from the Presentation:-
  1. JavaScript is completely independent of Java.It has nothing to do with java except the name resemblance. (Please listen to presentation to find out why)
  2. It is not a scripting language but a complete Functional Programming Language
  3. It is the most Popular Programming Language
  4. JavaScript has design errors
  5. All the books on JavaScript in market are bad except JavaScript the Definitive Guide 4th Edition.
  6. The first name of JavaScript was LiveScript which was created by Netscape
  7. LiveScript was the first language to be put into the browser.
  8. NetScape and Sun Microsystems joined hands and renamed LiveScript to JavaScript
  9. NetScape and Sun Microsystems joined hands to beat MicroSoft
  10. Microsoft reverse engineered JavaScript to create a language called JScript
  11. JavaScript is a small but sophisticated language
  12. Key Ideas :-
  • Load and go definition – This means that programs are executed as source code as text
  • Loose Typing
  • Object as generic container
  • Prototypal Inheritance — Which means Objects can inherit objects
  • lambda -Function as first class objects
  • Linkage through global variables

13.  When you use parseInt function always use radix parseInt(“08”,10)
14. JavaScript is case sensitive
15. JavaScript syntactically belongs to C family
16. == and != do type coersion
17. === and !== are faster and more reliable.
18. Bitwise operators are slower because JavaScript does not have Integers to first 64 bit floats are converted to 32 bit integer and then reconverted to 64 bit float.

These were some of the important points from the talk.
I will share my learning on javascript as i move along.