You are currently viewing Things to avoid while doing Web scraping

Things to avoid while doing Web scraping

Python is most trending language in developers communities

Due to its versatility and ease of use. One of the area where python is mostly used is web scraping, because of the large number of libraries that are available for any work, whether you need to create data-driven application, monitoring system or comparison applications. Python provides powerful libraries for it.

However, web scraping has its own difficulties, one of which is the possibility of getting websites to block you.

There is a misconception about web scraping in developers minds that it is illegal, but the truth is that it is perfectly legal unless you’re trying to access non-public data (data that is not reachable to the public, like login credentials).

When you start to scrape through simple websites like Wikipedia and any other small websites, you might not face any issues but when you try web scraping on some big websites like amazon, flipkart or even google you might find your requests getting ignored or even your IP getting blocked.

However, not being flagged as scraper is getting harder and harder as anti-bot technologies get ever more sophisticated and more widely used.

You can be detected by:

  • Sending so many requests at a time
  • Cookies/sessions
  • Headers
  • IP address
  • Browser fingerprints

In this blog we’re going to share common ways to not be detected you as a scraper and how to optimise your scraper so that you get the website html and not get blocked.

SENDING SO MANY REQUESTS AT A TIME

In this case, you have to wait a bit to make another request. For some security reasons, the period of time to wait might not be specified.

I will explain what the 429 error means and how a developer might have implemented. I will also show what you can do to resolve it.

The 429 error is an http status code. When the internet resource has exceeded and the number of requests it can send within a given period of time.

This error might be:

  • Error 429
  • 429 Too many requests
  • 429 (Too many requests)

What you can do to resolve the 429 error – >

You can use time.sleep() function to give a interval in every request that your code is sending to server.

And you have to also use random function to randomize the time interval in every request.

Step 1 : First you have to import required modules.

import time
import random
import requests

Step 2 : Here we are requesting an same site every 1 to 3 minutes.

while True:
     wait = random.randint(60,60*3)
     response = request.get('https://www.example.com')
     time.sleep(wait)

COOKIES/SESSIONS

A session is used to save information on the server momentarily so that it may be utilized across various pages of the website. It is the overall amount of time spent on an activity.

A cookie is a small text file that is saved on the user’s computer. The maximum file size for a cookie is 4KB. It is also known as an HTTP cookie, a web cookie, or an internet cookie. When a user first visits a website, the site sends data packets to the user’s computer in the form of a cookie.

import requests

# First create a dictionary of cookies using the syntax {key : value} where key is the cookie name and value is the cookie definition.

cookies_dict = {"my_cookie": "cookie_value"}

# Call requests.get() to send the cookies_dict to url.
response = requests.get("https://www.example.com",cookies=cookies_dict)#

           

HTTP HEADERS

HTTP headers allow the client and server to pass additional information while sending an HTTP request or receiving a subsequent response.

These headers are case-insensitive, meaning that the header of ‘User-Agent’ could also be written as ‘user-agent’.

Additionally, HTTP headers in the requests library are passed in using Python dictionaries, where the key represents the header name and the value represents the header value.

The chance that the server will block our scraper code goes down if we use http headers in our requests.

To pass HTTP headers into a GET request using Python, you can use the ‘headers=’ parameter in the ‘.get()’ function. The parameter accepts a Python dictionary of key-value pairs, where the key represents the header type and the value is the header value.

Let’s see the code:

import requests
url = 'https://www.example.com/'    
headers = {'Content-Type':'text/html'}
print(requests.get(url,headers=headers))

IP ADDRESS

When building a crawler for URL and data extraction, the simplest way for a defensive system to prevent access is to ban IPs, if a high number of requests from the same IP hit the server in a short time, they’ll mark that address.

The easiest way to avoid that is to use different IP addresses. But you cannot easily change that, and it’s almost impossible in the case of servers. So in order to rotate IPs, you’d need to perform your requests via a proxy server. These will keep your original requests unmodified, but the target server will see their IP, not yours.

Let’s see how to rotate a proxy in python.

You’ll need python3 on you device. Some systems have it pre-installed. After that, add all the necessary libraries by running pip install.

pip install aiohttp

Step 1: There are several list of free proxies online. For the demo, grab one of those and save its content (just the URLs) in a text file (rotating_proxies_list.txt).

Free proxies aren’t reliable, and the ones below probably won’t work for you.

They’re usually short-lived. For production scraping projects, we recommend using a data centre or residential proxies.

Here are some sample proxies that you have to save in rotating_proxies_list.txt file.

167.71.230.124:8080
192.155.107.211:1080
77.238.79.111:8080
167.71.5.83:3128
195.189.123.213:3128
8.210.83.33:80
80.48.119.28:8080
152.0.209.175:8080
187.217.54.84:80
169.57.1.85:8123

Then, we’ll read that file and create an array with all proxies. Read it, strip empty spaces, and split each line. Be careful when saving the document since we won’t perform any sanity checks for valid IP:port strings. We’ll keep it simple.

Let’s see the code:

import requests

proxies_list = open("rotating_proxies_list.txt", "r").read().strip().split("\n")

def get(url, proxy):
            try:
                        # Send proxy requests to the final URL
                        response = requests.get(url, proxies={'http': f"http://{proxy}"}, timeout=30)
                        print(response.status_code, response.text)
            except Exception as e:
                        print(e)

def check_proxies():
            proxy = proxies_list.pop()
            get("http://ident.me/", proxy)

check_proxies()

BROWSER FINGERPRINTS

Increasingly, to combat websites using antibot technologies, lots of developers are turning to using headless browsers like Puppeteer, Playwright, or Selenium to avoid getting blocked when scraping a website.

Using a headless browser does make your requests seem more like a real user than using a HTTP client, however, they aren’t a magic bullet and open up a pandoras box of ways for websites to test if you are a scraper or a real user.

Modern antibot technologies use browser fingerprinting, can detect browser automation leaks and integrate honeytraps and other challenges into the page that your scrapers could fail.

Here are some of the major issues:

1. Fixing Browser Leaks

Browsers provide information about themselves in the Javascript execution context, which the client (i.e. website) can access to verify that the browser is in fact a real user and not a bot.

By default, most headless browsers leak information that tells the website that the browser is automated and not a real user. To avoid being blocked you need to patch these leaks and fortify your browser fingerprint so that your scraper isn’t detected.

In our guide to fortifying your headless browser we go through in detail what some of these fingerprint leaks are, and how to patch them.

However, when you are using a headless browser for web scraping you should always use the stealth versions, as they often have the most common leaks fixed.

  • puppeteer-stealth
  • playwright-stealth
  • selenium-stealth

2. Consistent Identity

A common issue that many developers overlook is making sure the identity you are displaying in your headers, user-agent, browser, server and proxy are all consistent and match each other.

For example, if you are using a headless browser you need to make sure the user-agent string you defined matches the browser version you are using. Or if you are running your scraper on a Linux machine, but your user-agent says it is a Windows machine.

Inconsistencies like this won’t happen for real users visiting a website, so if you don’t make sure you use a consistent identity with every request then you are likely to get blocked.

Conclusion

If you like the article and would like to support me, make sure to:

Leave a Reply