The case of the infinite poop emoji crawler trap

I noticed that my crawler was getting stuck in a site even though I set a timeout to get. Little did I know that the site was full of poop.

What happened

My worker on Heroku was getting stuck trying to get a response from a URL. Turns out the website was a trap and would infinitely print out poop emojis. Touche RobePike, touche.

How I fixed it

Using the built-in timeout argument in Net::HTTP.read_timeout wasn’t doing the trick because the website request wasn’t actually timing out. The page would simply take forever.

I ended up simply wrapping my get_response method in a Ruby Timeout block.

How I found the solution

Was having trouble with getting the final redirect of a URL

  • Found out that the get_response part never ended or timed out
    • Opened the URL in the browser to see what was happening. Turns out, the page constantly prints the poop emoji causing the crawler to never complete. Genius.
      • Found this blog post showing how to set a get timeout
        • Only problem is, it doesn’t actually time out. The page just deliberately renders indefinitely.
          • Used Timeout::(5) instead but it immediately throws an exception of “end of file reached”
            • Looks like redirects to HTTPS need to have http.use_ssl = (uri.scheme == "https") Source
              • Timeout worked!