Python script to detect addition of new elements in a given URL using python and beautiful soup

In this post I will teach you how to create a simple Python script that will notify the user when a new element is added on the given URL. For this, we will need the URL of the target website and internet connection.

Procedure

1 Procedure
- - - - 1.0.0.0.1 Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,[3] which is useful for web scraping.
  - 1.0.1 Import necessary libraries
    - - 1.0.1.0.1 The final script should look something like this

We will use beautiful soup for this project.

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,^[3] which is useful for web scraping.

Import necessary libraries

import requests

from bs4 import BeautifulSoup

import schedule

import time

https://pypi.org/project/requests/

https://www.crummy.com/software/BeautifulSoup/

https://pypi.org/project/schedule/

def extract_new_elements(url):

    response = requests.get(url)

    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        current_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))

        new_elements = current_elements - previous_elements

        if new_elements:

            print("New elements found:", new_elements)



        else:

            print("No new elements found.")

    else:

        print("Failed to retrieve data from the URL.")

Importing Libraries: The function depends on the requests library to make HTTP requests and the BeautifulSoup class from the bs4 module to parse HTML content.
HTTP Request: It sends an HTTP GET request to the URL provided as the url parameter.
Response Handling: It checks if the response status code is 200, which indicates that the request was successful and the webpage is available.
Parsing HTML: If the response is successful, the function uses BeautifulSoup to parse the HTML content of the webpage.
Finding Current Elements: It then finds a specific <a> tag with a class attribute value of 'vehicles-export'. This tag likely contains data about vehicles. The function extracts the value of the 'data-vehicles' attribute from this tag and splits it into a set of elements. These elements represent the current data about vehicles available on the webpage.
Finding New Elements: It calculates the set difference between the current_elements set and a set named previous_elements. This subtraction operation identifies any elements that are in current_elements but not in previous_elements, effectively finding new elements since the last time this function was called.
Printing Results: If new elements are found, it prints a message indicating the presence of new elements along with the new elements themselves. If no new elements are found, it prints a message stating so.
Error Handling: If the HTTP request fails (e.g., due to a connection issue or an invalid URL), it prints an error message.

url = "https://remarketing.jyskefinans.dk/cars/"

def job():

    print("Checking for new elements...")

    extract_new_elements(url)

job() is a function that does one simple task. It wraps the extract_new_elements() function around it, so, in simple terms, whenever the function job is executed, extract_new_elements() will be executed. This will seem to serve a purpose in later parts of the script.

previous_elements = set()

Add this line after the job() function.
We are using a set dataset here because we are going to store all current elements on the URL. The purpose of using a set is so that if a new element is added, but it has already been added previously, then there is no use of revisiting it again. And, set data structure does the same. It does not store any duplicate elements.

https://www.geeksforgeeks.org/sets-in-python/

response = requests.get(url)

if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')

    previous_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))

else:

    print("Failed to retrieve initial data from the URL.")

This code fetches data from a specified URL using the requests.get() method. If the HTTP response status code is 200 (indicating success), it parses the HTML content using BeautifulSoup. Specifically, it searches for an <a> tag with the class attribute set to ‘vehicles-export’ and extracts the value of the ‘data-vehicles’ attribute. This value is split into elements and stored in a set named previous_elements.

schedule.every(15).seconds.do(job)

while True:

    schedule.run_pending()

    time.sleep(1)

This is the triggering snippet of code that will start looking for new elements on the given URL. It will not stop its execution until user manually closes the execution by closing the terminal by pressing ctrl + c.

The final script should look something like this

# Function to extract new elements from the URL

def extract_new_elements(url):

    response = requests.get(url)

    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        current_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))

        new_elements = current_elements - previous_elements

        if new_elements:

            print("New elements found:", new_elements)

            # Save new elements in a variable or perform any other desired action

            # For example:

            # new_elements_variable = list(new_elements)

        else:

            print("No new elements found.")

    else:

        print("Failed to retrieve data from the URL.")

# Function to be scheduled to run every 10 minutes

def job():

    print("Checking for new elements...")

    extract_new_elements(url)

# URL to scrape

url = "https://remarketing.jyskefinans.dk/cars/"

# Variable to store previously seen elements

previous_elements = set()

# Initial extraction of elements

response = requests.get(url)

if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')

    previous_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))

else:

    print("Failed to retrieve initial data from the URL.")

# Schedule job to run every 10 minutes

schedule.every(15).seconds.do(job)

# Run the scheduler

while True:

    schedule.run_pending()

    time.sleep(1)

The official documentation of the beautiful soup project can be found here:

https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

https://beautiful-soup-4.readthedocs.io/en/latest/

Procedure

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,[3] which is useful for web scraping.

Import necessary libraries

The final script should look something like this

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,^[3] which is useful for web scraping.