In this post I will teach you how to create a simple Python script that will notify the user when a new element is added on the given URL. For this, we will need the URL of the target website and internet connection.
Procedure
We will use beautiful soup for this project.
Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML,[3] which is useful for web scraping.
Import necessary libraries
import requests
from bs4 import BeautifulSoup
import schedule
import time
def extract_new_elements(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
current_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))
new_elements = current_elements - previous_elements
if new_elements:
print("New elements found:", new_elements)
else:
print("No new elements found.")
else:
print("Failed to retrieve data from the URL.")
Importing Libraries: The function depends on the
requests
library to make HTTP requests and theBeautifulSoup
class from thebs4
module to parse HTML content.HTTP Request: It sends an HTTP GET request to the URL provided as the
url
parameter.Response Handling: It checks if the response status code is
200
, which indicates that the request was successful and the webpage is available.Parsing HTML: If the response is successful, the function uses BeautifulSoup to parse the HTML content of the webpage.
Finding Current Elements: It then finds a specific
<a>
tag with a class attribute value of'vehicles-export'
. This tag likely contains data about vehicles. The function extracts the value of the'data-vehicles'
attribute from this tag and splits it into a set of elements. These elements represent the current data about vehicles available on the webpage.Finding New Elements: It calculates the set difference between the
current_elements
set and a set namedprevious_elements
. This subtraction operation identifies any elements that are incurrent_elements
but not inprevious_elements
, effectively finding new elements since the last time this function was called.Printing Results: If new elements are found, it prints a message indicating the presence of new elements along with the new elements themselves. If no new elements are found, it prints a message stating so.
Error Handling: If the HTTP request fails (e.g., due to a connection issue or an invalid URL), it prints an error message.
def job():
print("Checking for new elements...")
extract_new_elements(url)
job() is a function that does one simple task. It wraps the extract_new_elements()
function around it, so, in simple terms, whenever the function job
is executed, extract_new_elements()
will be executed. This will seem to serve a purpose in later parts of the script.
previous_elements = set()
We are using a set dataset here because we are going to store all current elements on the URL. The purpose of using a set is so that if a new element is added, but it has already been added previously, then there is no use of revisiting it again. And, set data structure does the same. It does not store any duplicate elements.
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
previous_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))
else:
print("Failed to retrieve initial data from the URL.")
This code fetches data from a specified URL using the requests.get()
method. If the HTTP response status code is 200 (indicating success), it parses the HTML content using BeautifulSoup. Specifically, it searches for an <a>
tag with the class attribute set to ‘vehicles-export’ and extracts the value of the ‘data-vehicles’ attribute. This value is split into elements and stored in a set named previous_elements
.
schedule.every(15).seconds.do(job)
while True:
schedule.run_pending()
time.sleep(1)
This is the triggering snippet of code that will start looking for new elements on the given URL. It will not stop its execution until user manually closes the execution by closing the terminal by pressing ctrl + c
.
The final script should look something like this
# Function to extract new elements from the URL
def extract_new_elements(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
current_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))
new_elements = current_elements - previous_elements
if new_elements:
print("New elements found:", new_elements)
# Save new elements in a variable or perform any other desired action
# For example:
# new_elements_variable = list(new_elements)
else:
print("No new elements found.")
else:
print("Failed to retrieve data from the URL.")
# Function to be scheduled to run every 10 minutes
def job():
print("Checking for new elements...")
extract_new_elements(url)
# URL to scrape
url = "https://remarketing.jyskefinans.dk/cars/"
# Variable to store previously seen elements
previous_elements = set()
# Initial extraction of elements
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
previous_elements = set(soup.find('a', class_='vehicles-export')['data-vehicles'].split(','))
else:
print("Failed to retrieve initial data from the URL.")
# Schedule job to run every 10 minutes
schedule.every(15).seconds.do(job)
# Run the scheduler
while True:
schedule.run_pending()
time.sleep(1)
The official documentation of the beautiful soup project can be found here:
https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html