Today, I was working on an application that required me to extract the main content html for a web page. This is called article extraction. Most of the time you want to extract the text of the article but I wanted to extract HTML of the main content. For example, if you are reading following WashingtonPost article then I want to extract the main HTML content on the left. I don’t want sidebar HTML containing ads or other information.
Continue reading “Building an Article Extraction Python API with newspaper3k and flask”