Playing with htmlq, awk, and sed


Last week I discovered htmlq, a CLI tool to extract content from HTML. It is similar to jq, a very powerful and popular command-line JSON processor.

The best way to learn a tool is to use it for something useful. In this short post, I am showing you how I used htmlq to extract content from my Github profile https://github.com/shekhargulati?tab=repositories.

Finding name of all the repositories on the first page

curl --silent https://github.com/shekhargulati\?tab\=repositories \
| htmlq 'a[itemprop="name codeRepository"]' \
| htmlq --text --ignore-whitespace \
| awk '{$1=$1};1' \
| sed '/^$/d'

It lists the last updated 30 repositories.

useful-microservices-reading-list
python-flask-docker-hello-world
textract
cookiecutter-spring-boot-ms-template
useful-twitter-threads
software-architecture-document-template
awesome-multitenancy
flask-login-example
project-wiki-template
okrapp
ziglings
timeflake-java
shekhargulati
30-seconds-of-java
99-problems
useful-tech-radars
first-git-commit
spring-boot-maven-angular-starter
boot-angular-pagination-example-app
covid-19-resources
must-read-resources-for-java-developers
strman-java
funwithlambdas
spring-boot-failure-analyzer-example
java8-the-missing-tutorial
image-resolver
fs-101-homework
copy-as-plain-text-chrome-extension
opentracing-microservices-example
k8s-workshop

Sum all the stars on the first page

curl --silent https://github.com/shekhargulati\?tab\=repositories \
| htmlq '.f6.color-text-secondary.mt-2'  \
| htmlq 'a[href*=stargazers]' --text \
| awk '{$1=$1};1' \
| sed '/^$/d' \
| sed  's/,//g' \
| awk '{s+=$1} END {print s}'

The output is 7680

Sum all the forks on the first page

curl --silent https://github.com/shekhargulati\?tab\=repositories \
| htmlq '.f6.color-text-secondary.mt-2'  \
| htmlq 'a[href*=members]' --text \
| awk '{$1=$1};1' \
| sed '/^$/d' \
| sed  's/,//g' \
| awk '{s+=$1} END {print s}'

The output is 2786.

List all the unique programming languages on the first page

curl --silent https://github.com/shekhargulati\?tab\=repositories \
| htmlq '.f6.color-text-secondary.mt-2'  \
| htmlq 'span[itemprop=programmingLanguage]' --text \
| awk '{$1=$1};1' \
| sed '/^$/d' \
| sort -u

The output is shown below.

HTML
Java
JavaScript
Python
Rust
TypeScript
Zig

List all pinned repos

curl --silent https://github.com/shekhargulati | htmlq "span.repo" --text

The output is shown below

52-technologies-in-2016
99-problems
java8-the-missing-tutorial
30-seconds-of-java
hands-on-serverless-guide
useful-microservices-reading-list

Leave a comment