The case of the infinite poop emoji crawler trap
08 Jan 2019I noticed that my crawler was getting stuck in a site even though I set a timeout to get
. Little did I know that the site was full of poop.
What happened
My worker on Heroku was getting stuck trying to get a response from a URL. Turns out the website was a trap and would infinitely print out poop emojis. Touche RobePike, touche.
How I fixed it
Using the built-in timeout argument in Net::HTTP.read_timeout
wasn’t doing the trick because the website request wasn’t actually timing out. The page would simply take forever.
I ended up simply wrapping my get_response
method in a Ruby Timeout
block.
How I found the solution
Was having trouble with getting the final redirect of a URL
- Found out that the
get_response
part never ended or timed out- Opened the URL in the browser to see what was happening. Turns out, the page constantly prints the poop emoji causing the crawler to never complete. Genius.
- Found this blog post showing how to set a get timeout
- Only problem is, it doesn’t actually time out. The page just deliberately renders indefinitely.
- Used Timeout::(5) instead but it immediately throws an exception of “end of file reached”
- Looks like redirects to HTTPS need to have
http.use_ssl = (uri.scheme == "https")
Source- Timeout worked!
- Looks like redirects to HTTPS need to have
- Used Timeout::(5) instead but it immediately throws an exception of “end of file reached”
- Only problem is, it doesn’t actually time out. The page just deliberately renders indefinitely.
- Found this blog post showing how to set a get timeout
- Opened the URL in the browser to see what was happening. Turns out, the page constantly prints the poop emoji causing the crawler to never complete. Genius.