How To Do A Sitemap Audit For Better Indexing & Crawling Via Python

Sitemap auditing entails syntax, crawlability, and indexation checks for the URLs and tags in your sitemap recordsdata.
A sitemap file comprises the URLs to index with additional data concerning the final modification date, precedence of the URL, pictures, movies on the URL, and different language alternates of the URL, together with the change frequency.
Sitemap index recordsdata can contain tens of millions of URLs, even when a single sitemap can solely contain 50,000 URLs on the prime.
Auditing these URLs for higher indexation and crawling would possibly take time.
But with the assistance of Python and search engine optimization automation, it’s doable to audit tens of millions of URLs throughout the sitemaps.
Navigate to:What Do You Need To Perform A Sitemap Audit With Python?Which URLs Should Be In The Sitemap?What Are The Benefits Of A Healthy XML Sitemap File?A 16-Step Sitemap Audit For search engine optimization With PythonConclusionWhat Do You Need To Perform A Sitemap Audit With Python?
To perceive the Python Sitemap Audit course of, you’ll want:
A elementary understanding of technical search engine optimization and sitemap XML recordsdata.
Working data of Python and sitemap XML syntax.
The means to work with Python Libraries, Pandas, Advertools, LXML, Requests, and XPath Selectors.
Which URLs Should Be In The Sitemap?
A wholesome sitemap XML sitemap file ought to embody the next standards:
All URLs ought to have a 200 Status Code.
All URLs ought to be self-canonical.
URLs ought to be open to being listed and crawled.
URLs shouldn’t be duplicated.
URLs shouldn’t be delicate 404s.
The sitemap ought to have a correct XML syntax.
The URLs within the sitemap ought to have an aligning canonical with Open Graph and Twitter Card URLs.
The sitemap ought to have lower than 50.000 URLs and a 50 MB dimension.
What Are The Benefits Of A Healthy XML Sitemap File?
Smaller sitemaps are higher than bigger sitemaps for sooner indexation. This is especially necessary in News search engine optimization, as smaller sitemaps assist for growing the general legitimate listed URL rely.
Differentiate regularly up to date and static content material URLs from one another to offer a greater crawling distribution among the many URLs.
Using the “lastmod” date in an trustworthy method that aligns with the precise publication or replace date helps a search engine to belief the date of the newest publication.
While performing the Sitemap Audit for higher indexing, crawling, and search engine communication with Python, the standards above are adopted.
An Important Note…
When it involves a sitemap’s nature and audit, Google and Microsoft Bing don’t use “changefreq” for altering frequency of the URLs and “precedence” to know the prominence of a URL. In truth, they name it a “bag of noise.”
However, Yandex and Baidu use all these tags to know the web site’s traits.
A 16-Step Sitemap Audit For search engine optimization With Python
A sitemap audit can contain content material categorization, site-tree, or topicality and content material traits.
However, a sitemap audit for higher indexing and crawlability primarily entails technical search engine optimization fairly than content material traits.
In this step-by-step sitemap audit course of, we’ll use Python to deal with the technical points of sitemap auditing tens of millions of URLs.
Image created by the creator, February 2022
1. Import The Python Libraries For Your Sitemap Audit
The following code block is to import the required Python Libraries for the Sitemap XML File audit.
import advertools as adv

import pandas as pd

from lxml import etree

from import show, HTML


Here’s what you should find out about this code block:
Advertools is critical for taking the URLs from the sitemap file and making a request for taking their content material or the response standing codes.
“Pandas” is critical for aggregating and manipulating the information.
Plotly is critical for the visualization of the sitemap audit output.
LXML is critical for the syntax audit of the sitemap XML file.
IPython is elective to increase the output cells of Jupyter Notebook to 100% width.
2. Take All Of The URLs From The Sitemap
Millions of URLs might be taken right into a Pandas knowledge body with Advertools, as proven beneath.
sitemap_url = “”
sitemap = adv.sitemap_to_df(sitemap_url)
sitemap_df = pd.read_csv(“sitemap.csv”, index_col=False)
sitemap_df.drop(columns=[“Unnamed: 0”], inplace=True)
Above, the sitemap has been taken right into a Pandas knowledge body, and you’ll see the output beneath.

A General Sitemap URL Extraction with Sitemap Tags with Python is above.
In complete, we’ve 245,691 URLs within the sitemap index file of
The web site makes use of “changefreq,” “lastmod,” and “precedence” with an inconsistency.
3. Check Tag Usage Within The Sitemap XML File
To perceive which tags are used or not throughout the Sitemap XML file, use the operate beneath.
def check_sitemap_tag_usage(sitemap):
lastmod = sitemap[“lastmod”].isna().value_counts()
precedence = sitemap[“priority”].isna().value_counts()
changefreq = sitemap[“changefreq”].isna().value_counts()
lastmod_perc = sitemap[“lastmod”].isna().value_counts(normalize = True) * 100
priority_perc = sitemap[“priority”].isna().value_counts(normalize = True) * 100
changefreq_perc = sitemap[“changefreq”].isna().value_counts(normalize = True) * 100
sitemap_tag_usage_df = pd.DataFrame(knowledge={“lastmod”:lastmod,
“lastmod_perc”: lastmod_perc,
“priority_perc”: priority_perc,
“changefreq_perc”: changefreq_perc})
return sitemap_tag_usage_df.astype(int)
The operate check_sitemap_tag_usage is a knowledge body constructor primarily based on the utilization of the sitemap tags.
It takes the “lastmod,” “precedence,” and “changefreq” columns by implementing “isna()” and “value_counts()” strategies through “pd.DataFrame”.
Below, you possibly can see the output.
Sitemap Audit with Python for sitemap tags’ utilization.
The knowledge body above exhibits that 96,840 of the URLs should not have the Lastmod tag, which is the same as 39% of the entire URL rely of the sitemap file.
The identical utilization proportion is nineteen% for the “precedence” and the “changefreq” throughout the sitemap XML file.
There are three most important content material freshness alerts from an internet site.
These are dates from an online web page (seen to the person), structured knowledge (invisible to the person), “lastmod” within the sitemap.
If these dates will not be in step with one another, search engines like google and yahoo can ignore the dates on the web sites to see their freshness alerts.
4. Audit The Site-tree And URL Structure Of The Website
Understanding an important or crowded URL Path is critical to weigh the web site’s search engine optimization efforts or technical search engine optimization Audits.
A single enchancment for Technical search engine optimization can profit 1000’s of URLs concurrently, which creates an economical and budget-friendly search engine optimization technique.
URL Structure Understanding primarily focuses on the web site’s extra outstanding sections and content material community evaluation understanding.
To create a URL Tree Dataframe from an internet site’s URLs from the sitemap, use the next code block.
sitemap_url_df = adv.url_to_df(sitemap_df[“loc”])
With the assistance of “urllib” or the “advertools” as above, you possibly can simply parse the URLs throughout the sitemap into a knowledge body.

Creating a URL Tree with URLLib or Advertools is simple.

Checking the URL breakdowns helps to know the general data tree of an internet site.
The knowledge body above comprises the “scheme,” “netloc,” “path,” and each “/” breakdown throughout the URLs as a “dir” which represents the listing.
Auditing the URL construction of the web site is outstanding for 2 targets.
These are checking whether or not all URLs have “HTTPS” and understanding the content material community of the web site.
Content evaluation with sitemap recordsdata will not be the subject of the “Indexing and Crawling” immediately, thus on the finish of the article, we are going to discuss it barely.
Check the following part to see the SSL Usage on Sitemap URLs.
5. Check The HTTPS Usage On The URLs Within Sitemap
Use the next code block to examine the HTTP Usage ratio for the URLs throughout the Sitemap.
The code block above makes use of a easy knowledge filtration for the “scheme” column which comprises the URLs’ HTTPS Protocol data.
utilizing the “value_counts” we see that every one URLs are on the HTTPS.
Checking the HTTP URLs from the Sitemaps can assist to seek out greater URL Property consistency errors.
6. Check The Robots.txt Disallow Commands For Crawlability
The construction of URLs throughout the sitemap is helpful to see whether or not there’s a scenario for “submitted however disallowed”.
To see whether or not there’s a robots.txt file of the web site, use the code block beneath.
import requests
r = requests.get(“”)
Simply, we ship a “get request” to the robots.txt URL.
If the response standing code is 200, it means there’s a robots.txt file for the user-agent-based crawling management.
After checking the “robots.txt” existence, we are able to use the “adv.robotstxt_test” methodology for bulk robots.txt audit for crawlability of the URLs within the sitemap.
sitemap_df_robotstxt_check = adv.robotstxt_test(“”, urls=sitemap_df[“loc”], user_agents=[“*”])
We have created a brand new variable referred to as “sitemap_df_robotstxt_check”, and assigned the output of the “robotstxt_test” methodology.
We have used the URLs throughout the sitemap with the “sitemap_df[“loc”]”.
We have carried out the audit for the entire user-agents through the “user_agents = [“*”]” parameter and worth pair.
You can see the outcome beneath.
True 245690
False 1
Name: can_fetch, dtype: int64
It exhibits that there’s one URL that’s disallowed however submitted.
We can filter the precise URL as beneath.
sitemap_df_robotstxt_check[sitemap_df_robotstxt_check[“can_fetch”] == False]
We have used “set_option” to increase the entire values throughout the “url_path” part.

A URL seems as disallowed however submitted through a sitemap as in Google Search Console Coverage Reports.

We see {that a} “profile” web page has been disallowed and submitted.
Later, the identical management might be accomplished for additional examinations similar to “disallowed however internally linked”.
But, to do this, we have to crawl not less than 3 million URLs from, and it may be a completely new information.
Some web site URLs should not have a correct “listing hierarchy”, which might make the evaluation of the URLs, when it comes to content material community traits, more durable. doesn’t use a correct URL construction and taxonomy, so analyzing the web site construction will not be simple for an search engine optimization or Search Engine.
But essentially the most used phrases throughout the URLs or the content material replace frequency can sign which subject the corporate truly weighs on.
Since we give attention to “technical points” on this tutorial, you possibly can learn the Sitemap Content Audit right here.
7. Check The Status Code Of The Sitemap URLs With Python
Every URL throughout the sitemap has to have a 200 Status Code.
A crawl must be carried out to examine the standing codes of the URLs throughout the sitemap.
But, because it’s expensive when you may have tens of millions of URLs to audit, we are able to merely use a brand new crawling methodology from Advertools.
Without taking the response physique, we are able to crawl simply the response headers of the URLs throughout the sitemap.
It is beneficial to lower the crawl time for auditing doable robots, indexing, and canonical alerts from the response headers.
To carry out a response header crawl, use the “adv.crawl_headers” methodology.
adv.crawl_headers(sitemap_df[“loc”], output_file=”sitemap_df_header.jl”)
df_headers = pd.read_json(“sitemap_df_header.jl”, traces=True)
The clarification of the code block for checking the URLs’ standing codes throughout the Sitemap XML Files for the Technical search engine optimization side might be seen beneath.
200 207866
404 23
Name: standing, dtype: int64
It exhibits that the 23 URL from the sitemap is definitely 404.
And, they need to be faraway from the sitemap.
To audit which URLs from the sitemap are 404, use the filtration methodology beneath from Pandas.
df_headers[df_headers[“status”] == 404]
The outcome might be seen beneath.
Finding the 404 URLs from Sitemaps is useful in opposition to Link Rot.
8. Check The Canonicalization From Response Headers
From time to time, utilizing canonicalization hints on the response headers is helpful for crawling and indexing sign consolidation.
In this context, the canonical tag on the HTML and the response header must be the identical.
If there are two completely different canonicalization alerts on an online web page, the various search engines can ignore each assignments.
For, we don’t have a canonical response header.
The first step is auditing whether or not the response header for canonical utilization exists.
The second step is evaluating the response header canonical worth to the HTML canonical worth if it exists.
The third step is checking whether or not the canonical values are self-referential.
Check the columns of the output of the header crawl to examine the Canonicalization from Response Headers.
Below, you possibly can see the columns.
Python search engine optimization Crawl Output Data Frame columns. “dataframe.columns” methodology is all the time helpful to examine.
If you aren’t aware of the response headers, you might not know use canonical hints inside response headers.
A response header can embody the canonical trace with the “Link” worth.
It is registered as “resp_headers_link” by the Advertools immediately.
Another downside is that the extracted strings seem throughout the “;” string sample.
It means we are going to use regex to extract it.
You can see the outcome beneath.
Screenshot from Pandas, February 2022
The regex sample “[^<>][a-z:/0-9-.]*” is nice sufficient to extract the precise canonical worth.
A self-canonicalization examine with the response headers is beneath.
df_headers[“response_header_canonical”] = df_headers[“resp_headers_link”].str.extract(r”([^<>][a-z:/0-9-.]*)”)
(df_headers[“response_header_canonical”] == df_headers[“url”]).value_counts()
We have used two completely different boolean checks.
One to examine whether or not the response header canonical trace is the same as the URL itself.
Another to see whether or not the standing code is 200.
Since we’ve 404 URLs throughout the sitemap, their canonical worth will likely be “NaN”.
It exhibits there are particular URLs with canonicalization inconsistencies.
We have 29 outliers for Technical search engine optimization. Every flawed sign given to the search engine for indexation or rating will trigger the dilution of the rating alerts.
To see these URLs, use the code block beneath.
Screenshot from Pandas, February 2022.
The Canonical Values from the Response Headers might be seen above.
df_headers[(df_headers[“response_header_canonical”] != df_headers[“url”]) & (df_headers[“status”] == 200)]
Even a single “/” within the URL could cause canonicalization battle as seems right here for the homepage. Screenshot for checking the Response Header Canonical Value and the Actual URL of the online web page.

You can examine the canonical battle right here.
If you examine log recordsdata, you will notice that the search engine crawls the URLs from the “Link” response headers.
Thus in technical search engine optimization, this ought to be weighted.
9. Check The Indexing And Crawling Commands From Response Headers
There are 14 completely different X-Robots-Tag specs for the Google search engine crawler.
The newest one is “indexifembedded” to find out the indexation quantity on an online web page.
The Indexing and Crawling directives might be within the type of a response header or the HTML meta tag.
This part focuses on the response header model of indexing and crawling directives.
The first step is checking whether or not the X-Robots-Tag property and values exist throughout the HTTP Header or not.
The second step is auditing whether or not it aligns itself with the HTML Meta Tag properties and values in the event that they exist.
Use the command beneath yo examine the X-Robots-Tag” from the response headers.
def robots_tag_checker(dataframe:pd.DataFrame):
for i in df_headers:
if i.__contains__(“robots”):
return i
return “There is not any robots tag”
‘There is not any robots tag’
We have created a customized operate to examine the “X-Robots-tag” response headers from the online pages’ supply code.
It seems that our take a look at topic web site doesn’t use the X-Robots-Tag.
If there can be an X-Robots-tag, the code block beneath ought to be used.
df_headers[df_headers[“response_header_x_robots_tag”] == “noindex”]
Check whether or not there’s a “noindex” directive from the response headers, and filter the URLs with this indexation battle.
In the Google Search Console Coverage Report, these seem as “Submitted marked as noindex”.
Contradicting indexing and canonicalization hints and alerts would possibly make a search engine ignore the entire alerts whereas making the search algorithms belief much less to the user-declared alerts.
10. Check The Self Canonicalization Of Sitemap URLs
Every URL within the sitemap XML recordsdata ought to give a self-canonicalization trace.
Sitemaps ought to solely embody the canonical variations of the URLs.
The Python code block on this part is to know whether or not the sitemap URLs have self-canonicalization values or not.
To examine the canonicalization from the HTML Documents’ “” part, crawl the web sites by taking their response physique.
Use the code block beneath.
user_agent = “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (appropriate; Googlebot/2.1; +”
The distinction between “crawl_headers” and the “crawl” is that “crawl” takes your complete response physique whereas the “crawl_headers” is just for response headers.



custom_settings={“LOG_FILE”:”sitemap_crawl_complaintsboard.log”, “USER_AGENT”:user_agent})
You can examine the file dimension variations from crawl logs to response header crawl and full response physique crawl.
Python Crawl Output Size Comparison.
From 6GB output to the 387 MB output is kind of economical.
If a search engine simply desires to see sure response headers and the standing code, creating data on the headers would make their crawl hits extra economical.
How To Deal With Large DataFrames For Reading And Aggregating Data?
This part requires coping with the massive knowledge frames.
A laptop can’t learn a Pandas DataFrame from a CSV or JL file if the file dimension is bigger than the pc’s RAM.
Thus, the “chunking” methodology is used.
When an internet site sitemap XML File comprises tens of millions of URLs, the entire crawl output will likely be bigger than tens of gigabytes.
An iteration throughout sitemap crawl output knowledge body rows is critical.
For chunking, use the code block beneath.
df_iterator = pd.read_json(



for i, df_chunk in enumerate(df_iterator):

output_df = pd.DataFrame(knowledge={“url”:df_chunk[“url”],”canonical”:df_chunk[“canonical”], “self_canonicalised”:df_chunk[“url”] == df_chunk[“canonical”]})
mode=”w” if i == 0 else ‘a’

header = i == 0







df[((df[“url”] != df[“canonical”]) == True) & (df[“self_canonicalised”] == False) & (df[“canonical”].isna() != True)]
You can see the outcome beneath.
Python search engine optimization Canonicalization Audit.
We see that the paginated URLs from the “e book” subfolder give canonical hints to the primary web page, which is a non-correct observe in accordance with the Google pointers.
11. Check The Sitemap Sizes Within Sitemap Index Files
Every Sitemap File ought to be lower than 50 MB. Use the Python code block beneath within the Technical search engine optimization with Python context to examine the sitemap file dimension.
pd.pivot_table(sitemap_df[sitemap_df[“loc”].duplicated()==True], index=”sitemap”)
You can see the outcome beneath.
Python search engine optimization Sitemap Size Audit.
We see that every one sitemap XML recordsdata are beneath 50MB.
For higher and sooner indexation, protecting the sitemap URLs worthwhile and distinctive whereas reducing the scale of the sitemap recordsdata is helpful.
12. Check The URL Count Per Sitemap With Python
Every URL throughout the sitemaps ought to have fewer than 50.000 URLs.
Use the Python code block beneath to examine the URL Counts throughout the sitemap XML recordsdata.




.sort_values(by=”loc”, ascending=False))
You can see the outcome beneath.
Python search engine optimization Sitemap URL Count Audit.

All sitemaps have lower than 50.000 URLs. Some sitemaps have just one URL, which wastes the search engine’s consideration.
Keeping sitemap URLs which are regularly up to date completely different from the static and off content material URLs is helpful.
URL Count and URL Content character variations assist a search engine to regulate crawl demand successfully for various web site sections.
13. Check The Indexing And Crawling Meta Tags From URLs’ Content With Python
Even if an online web page will not be disallowed from robots.txt, it may nonetheless be disallowed from the HTML Meta Tags.
Thus, checking the HTML Meta Tags for higher indexation and crawling is critical.
Using the “customized selectors” is critical to carry out the HTML Meta Tag audit for the sitemap URLs.
sitemap = adv.sitemap_to_df(“”)

adv.crawl(url_list=sitemap[“loc”][:1000], output_file=”meta_command_audit.jl”,


xpath_selectors= {“meta_command”: “//meta[@name=”robots”]/@content material”},


df_meta_check = pd.read_json(“meta_command_audit.jl”, traces=True)

df_meta_check[“meta_command”].str.comprises(“nofollow|noindex”, regex=True).value_counts()
The “//meta[@name=”robots”]/@content material” XPATH selector is to extract all of the robots instructions from the URLs from the sitemap.
We have used solely the primary 1000 URLs within the sitemap.
And, I cease crawling after the preliminary 1000 responses.
I’ve used one other web site to examine the Crawling Meta Tags since doesn’t have it on the supply code.
You can see the outcome beneath.
Python search engine optimization Meta Robots Audit.

None of the URLs from the sitemap have “nofollow” or “noindex” throughout the “Robots” instructions.
To examine their values, use the code beneath.
df_meta_check[df_meta_check[“meta_command”].str.comprises(“nofollow|noindex”, regex=True) == False][[“url”, “meta_command”]]
You can see the outcome beneath.
Meta Tag Audit from the Websites.
14. Validate The Sitemap XML File Syntax With Python
Sitemap XML File Syntax validation is critical to validate the mixing of the sitemap file with the search engine’s notion.
Even if there are specific syntax errors, a search engine can acknowledge the sitemap file through the XML Normalization.
But, each syntax error can lower the effectivity for sure ranges.
Use the code block beneath to validate the Sitemap XML File Syntax.
def validate_sitemap_syntax(xml_path: str, xsd_path: str)
xmlschema_doc = etree.parse(xsd_path)
xmlschema = etree.XMLSchema(xmlschema_doc)
xml_doc = etree.parse(xml_path)
outcome = xmlschema.validate(xml_doc)
return outcome
validate_sitemap_syntax(“sej_sitemap.xml”, “sitemap.xsd”)
For this instance, I’ve used “”. The XSD file entails the XML file’s context and tree construction.
It is acknowledged within the first line of the Sitemap file as beneath.
For additional data, you too can examine DTD documentation.
15. Check The Open Graph URL And Canonical URL Matching
It will not be a secret that search engines like google and yahoo additionally use the Open Graph and RSS Feed URLs from the supply code for additional canonicalization and exploration.
The Open Graph URLs ought to be the identical because the canonical URL submission.
From time to time, even in Google Discover, Google chooses to make use of the picture from the Open Graph.
To examine the Open Graph URL and Canonical URL consistency, use the code block beneath.
for i, df_chunk in enumerate(df_iterator):

if “og:url” in df_chunk.columns:

output_df = pd.DataFrame(knowledge={



“open_graph_canonical_consistency”:df_chunk[“canonical”] == df_chunk[“og:url”]})

mode=”w” if i == 0 else ‘a’

header = i == 0







print(“There is not any Open Graph URL Property”)
There is not any Open Graph URL Property
If there may be an Open Graph URL Property on the web site, it can give a CSV file to examine whether or not the canonical URL and the Open Graph URL are the identical or not.
But for this web site, we don’t have an Open Graph URL.
Thus, I’ve used one other web site for the audit.
if “og:url” in df_meta_check.columns:

output_df = pd.DataFrame(knowledge={



“open_graph_canonical_consistency”:df_meta_check[“canonical”] == df_meta_check[“og:url”]})

mode=”w” if i == 0 else ‘a’

#header = i == 0







print(“There is not any Open Graph URL Property”)

df = pd.read_csv(“df_og_url_canonical_audit.csv”)

You can see the outcome beneath.
Python search engine optimization Open Graph URL Audit.
We see that every one canonical URLs and the Open Graph URLs are the identical.
Python search engine optimization Canonicalization Audit.
16. Check The Duplicate URLs Within Sitemap Submissions
A sitemap index file shouldn’t have duplicated URLs throughout completely different sitemap recordsdata or throughout the identical sitemap XML file.
The duplication of the URLs throughout the sitemap recordsdata could make a search engine obtain the sitemap recordsdata much less since a sure proportion of the sitemap file is bloated with pointless submissions.
For sure conditions, it may seem as a spamming try to manage the crawling schemes of the search engine crawlers.
use the code block beneath to examine the duplicate URLs throughout the sitemap submissions.
You can see that the 49574 URLs from the sitemap are duplicated.
Python search engine optimization Duplicated URL Audit from the Sitemap XML Files
To see which sitemaps have extra duplicated URLs, use the code block beneath.
pd.pivot_table(sitemap_df[sitemap_df[“loc”].duplicated()==True], index=”sitemap”, values=”loc”, aggfunc=”rely”).sort_values(by=”loc”, ascending=False)
You can see the outcome.
Python search engine optimization Sitemap Audit for duplicated URLs.
Chunking the sitemaps can assist with site-tree and technical search engine optimization evaluation.
To see the duplicated URLs throughout the Sitemap, use the code block beneath.
sitemap_df[sitemap_df[“loc”].duplicated() == True]
You can see the outcome beneath.
Duplicated Sitemap URL Audit Output.
I needed to point out validate a sitemap file for higher and more healthy indexation and crawling for Technical search engine optimization.
Python is vastly used for knowledge science, machine studying, and pure language processing.
But, you too can use it for Technical search engine optimization Audits to assist the opposite search engine optimization Verticals with a Holistic search engine optimization Approach.
In a future article, we are able to increase these Technical search engine optimization Audits additional with completely different particulars and strategies.
But, generally, this is likely one of the most complete Technical search engine optimization guides for Sitemaps and Sitemap Audit Tutorial with Python.
More assets: 
Featured Image: elenasavchina2/Shutterstock

Recommended For You