Today, I was working on an application that required me to extract the main content html for a web page. This is called article extraction. Most of the time you want to extract the text of the article but I wanted to extract HTML of the main content. For example, if you are reading following WashingtonPost article then I want to extract the main HTML content on the left. I don’t want sidebar HTML containing ads or other information.
Today for my 30 day challenge, I decided to learn how to do text and image extraction from web links using the Java programming language. This is a very common requirement in most of the content discovery websites like Prismatic. In this blog, we will learn how we can use a Java library called boilerpipe to accomplish this task. Read the full blog here https://www.openshift.com/blogs/day-18-boilerpipe-article-extraction-for-java-developers
Today for my 30 day challenge, I decided to learn how to do article extraction using the Python programming language. I have been interested in article extraction for a few month when I wanted to write a Prismatic clone. Prismatic creates a news feed based on user interest. Extracting article’s main content, images, and other meta information is a very common requirement in most of the content discovery websites like Prismatic. In this blog post, we will learn how we can use a Python package called goose-extractor to accomplish this task. We will first cover some basics, and then we will develop a simple Flask application which will use the Goose Extractor API. Read the full article here https://www.openshift.com/blogs/day-16-goose-extractor-an-article-extractor-that-just-works