Day 18: BoilerPipe–Article Extraction for Java Developers

Today for my 30 day challenge, I decided to learn how to do text and image extraction from web links using the Java programming language. This is a very common requirement in most of the content discovery websites like Prismatic. In this blog, we will learn how we can use a Java library called boilerpipe to accomplish this task. Read the full blog here https://www.openshift.com/blogs/day-18-boilerpipe-article-extraction-for-java-developers

Advertisements

Day 16: Goose Extractor–An Article Extractor That Just Works

Today for my 30 day challenge, I decided to learn how to do article extraction using the Python programming language. I have been interested in article extraction for a few month when I wanted to write a Prismatic clone. Prismatic creates a news feed based on user interest. Extracting article’s main content, images, and other meta information is a very common requirement in most of the content discovery websites like Prismatic. In this blog post, we will learn how we can use a Python package called goose-extractor to accomplish this task. We will first cover some basics, and then we will develop a simple Flask application which will use the Goose Extractor API. Read the full article here https://www.openshift.com/blogs/day-16-goose-extractor-an-article-extractor-that-just-works