The case of the infinite poop emoji crawler trap

08 Jan 2019

I noticed that my crawler was getting stuck in a site even though I set a timeout to get. Little did I know that the site was full of poop.

What happened

My worker on Heroku was getting stuck trying to get a response from a URL. Turns out the website was a trap and would infinitely print out poop emojis. Touche RobePike, touche.

How I fixed it

Using the built-in timeout argument in Net::HTTP.read_timeout wasn’t doing the trick because the website request wasn’t actually timing out. The page would simply take forever.

I ended up simply wrapping my get_response method in a Ruby Timeout block.

How I found the solution

Was having trouble with getting the final redirect of a URL

Found out that the get_response part never ended or timed out
- Opened the URL in the browser to see what was happening. Turns out, the page constantly prints the poop emoji causing the crawler to never complete. Genius.
  - Found this blog post showing how to set a get timeout
    - Only problem is, it doesn’t actually time out. The page just deliberately renders indefinitely.
      - Used Timeout::(5) instead but it immediately throws an exception of “end of file reached”
        
        Looks like redirects to HTTPS need to have http.use_ssl = (uri.scheme == "https") Source
        
        Timeout worked!

Trial and Forget

The case of the infinite poop emoji crawler trap

What happened

How I fixed it

How I found the solution