Lo último en XPath. Cómo escribir fácilmente potentes selectores.

Mihai Maxim el 16 dic 2022

¿Una hoja de trucos XPath?

¿Alguna vez has necesitado escribir un selector CSS que sea independiente de la clase? Si tu respuesta es no, bueno, puedes considerarte afortunado. Si la respuesta es sí, entonces nuestra XPath Cheat Sheet es lo que necesitas. La web está repleta de datos. Negocios enteros dependen de juntar algunos de ellos para traer nuevos servicios al mundo. Las APIs son de gran utilidad, pero no todos los sitios web tienen APIs abiertas. A veces, tendrás que conseguir lo que necesitas a la antigua usanza. Tendrás que construir un scraper para el sitio web. Los sitios web modernos eluden el scraping renombrando sus clases CSS. Como resultado, es mejor escribir selectores que dependan de algo más estable. En este artículo, aprenderás a escribir selectores basados en la disposición de los nodos DOM de la página.

¿Qué es XPath y cómo puedo probarlo?

XPath son las siglas de XML Path Language. Utiliza una notación de ruta (como en las URL) para proporcionar una forma flexible de apuntar a cualquier parte de un documento XML.

XPath se utiliza principalmente en XSLT, pero también puede utilizarse como una forma mucho más potente de navegar por el DOM de cualquier documento en lenguaje similar a XML utilizando XPathExpression, como HTML y SVG, en lugar de depender de los métodos Document.getElementById() o Document.querySelectorAll(), las propiedades Node.childNodes y otras características del DOM Core. XPath | MDN (mozilla.org)

¿Una notación de ruta?

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Nothing to see here</title>
</head>
<body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <div>
        <h2>My Second Heading</h2>
        <p>My second paragraph.</p>
        <div>
            <h3>My Third Heading</h3>
            <p>My third paragraph.</p>
        </div>
    </div>
</body>
</html>

Existen dos tipos de rutas: relativas y absolutas

La ruta única ( o ruta absoluta ) a Mi tercer párrafo. es /html/body/div/div/p

Una ruta relativa a Mi tercer párrafo. es //body/div/div/p
Para Mi segundo encabezado. => //body/div/h2
Para Mi primer párrafo. => //body/p

Observa que estoy usando //body. Las rutas relativas utilizan // para saltar directamente al elemento deseado.

The usage of //<path> also implies that it should look for all occurrences of <path> in the document, regardless of what came before <path>.

For example, //div/p returns both My second paragraph. and My third paragraph.

Puedes probar este ejemplo en tu navegador para tener una mejor visión de conjunto.

Pega el código en un archivo .html y ábrelo con tu navegador. Abre las herramientas de desarrollo y pulsa control + F. Pega el localizador XPath en la pequeña barra de entrada y pulsa intro.

También puede obtener el XPath de cualquier etiqueta haciendo clic con el botón derecho sobre ella en la pestaña Elementos y seleccionando "Copiar XPath"

Fíjate cómo cambio entre "Mi segundo párrafo" y "Mi tercer párrafo".

Also, another important thing to know is that it is not necessary for a path to contain // in order to return multiple elements. Let's see what happens when I add another <p> in the last <div>.

/html/body/div/div/p ya no es una ruta absoluta.

Si me has seguido hasta aquí, enhorabuena, estás en el buen camino para dominar XPath. Ahora estás listo para sumergirte en las cosas divertidas.

Los corchetes

Puede utilizar los corchetes para seleccionar elementos específicos.

 In this case, //body/div/div[2]/p[3] only selects the last <p> tag.

Atributos

También puede utilizar atributos para seleccionar sus elementos.

//body//p[@class="not-important"] => select all the <p> tags that are inside a <body> tag and have the "not-important" class.

//div[@id] => select all the <div> tags that have an id attribute.

//div[@class="p-children"][@id="important"]/p[3] => select the third <p> that is within a <div> tag that has both class="p-children" and id="important"

//div[@class="p-children" and @id="important"]/p[3] => same as above

//div[@class="p-children" or @id="important"]/p[3] => select the third <p> that is within a <div> that has class="p-children" or id="important"

Observa que @ marca el inicio de un atributo

Funciones

XPath proporciona un conjunto de funciones útiles que puede utilizar dentro de los corchetes.

position() => returns the index of the element
Ex: //body/div[position()=1] selects the first <div> in the <body>

last() => returns the last element
Ex: //div/p[last()] selects all the last <p> children of all the <div> tags

count(element) => returns the number of elements
Ex: //body/count(div) returns the number of child <div> tags inside the <body>

node() or * => returns any element
Ex: //div/node() and //div/*=> selects all the children of all the <div> tags

text() => returns the text of the element
Ex: //p/text() returns the text of all the <p> elements

concat(cadena1, cadena2) => combina cadena1 con cadena2

contains(@attribute, "value") => returns true if @attribute contains "value" 
Ex:
 //p[contains(text(),"I am the third child")] selects all the <p> tags that have the "I am the third child" text value.

starts-with(@atributo, "valor") => devuelve true si @atributo empieza por "valor" 
ends-with(@atributo, "valor") => devuelve true si @atributo acaba por "valor"

substring(@atributo,inicio_índice,fin_índice)] => devuelve la subsecuencia del valor del atributo basada en dos valores de índice
Ejemplo:
//p[substring(texto(),3,12)="soy el tercero"] => devuelve true si texto() = "soy el tercer hijo"

normalize-space() => actúa como text(), pero elimina los espacios finales
Ejemplo: normalize-space(" ejemplo ") = "ejemplo"

string-length() => returns the length of the text
Ex: //p[string-length()=20] returns all the <p> tags that have the text length of 20

Las funciones pueden ser un poco difíciles de recordar. Por suerte, The Ultimate Xpath Cheat Sheet proporciona ejemplos útiles:

//p[text()=concat(substring(//p[@class="not-important"]/text(),1,15), substring(text(),16,20))]

//p[text()=<expression_return_value>] will select all the <p> elements that have the text value equal to the return value of the condition.

//p[@class="not-important"]/text() returns the text values of all the <p> tags that have class="not-important".

If there is only one <p> tag that satisfies this condition, then we can pass the return_value to the substring function.

substring(valor_de_retorno,1,15) devolverá los 15 primeros caracteres de la cadena valor_de_retorno.

substring(text(),16,20) devolverá los 5 últimos caracteres del mismo

text() value that we used in //p[text()=<expression_return_value>].

Finally, concat() will merge the two substrings and create the return value of <expression_return_value>.

Anidamiento de rutas

XPath admite el anidamiento de rutas. Eso está bien, pero ¿qué quiero decir exactamente con anidamiento de rutas?

Probemos algo nuevo: /html/body/div[./div[./p]]

You can read it as "Select all the <div> sons of the <body> that have a <div> child. Also, the children must also be parents to a <p> element."

If you don't care about the father of the <p> element, you can write: /html/body/div[.//p]

This now translates to "Select all the div children of the body that have a <p> descendant"

En este ejemplo concreto, /html/body/div[./div[./p]] y /html/body/div[.//p] dan el mismo resultado.

A estas alturas, seguro que te estás preguntando qué pasa con esos puntos en ./ y .//

El punto representa el elemento self. Cuando se utiliza en un par de corchetes, hace referencia a la etiqueta específica que los abrió. Profundicemos un poco más.

In our example, /html/body/div returns two divs:
<div class="no-content"> and <div class="content">

/html/body/div[.//p] se traduce por:

   /html/body/div[1][/html/body/div[1]//p]
y /html/body/div[2][/html/body/div[2]//p].

/html/body/div[2][/html/body/div[2]//p] es verdadero, por lo que devuelve /html/body/div[2]

In our case, the dot ensures that /html/body/div and /html/body/div//p refer to the same <div>

Ahora veamos qué habría pasado si no lo hubiera hecho.

/html/body/div[/html/body/div//p] would return both 
<div class="no-content">  and <div class="content">

¿Por qué? Porque /html/body/div//p es verdadero tanto para /html/body/div[1] como para /html/body/div[2].

/html/body/div[/html/body/div//p] actually translates to "Select all the div children of the <body> if /html/body/div//p is true.

/html/body/div//p is true if the body has a <div> child, and that child has a <p> descendent". In our case, this statement is always true.

Es una pena que otras Xpath Cheat Sheets no mencionen nada sobre nesting. Yo lo considero asombroso. Te permite explorar el documento en busca de diferentes patrones y volver para devolver algo más. El único inconveniente es que escribir consultas de esta manera puede llegar a ser difícil de seguir. La buena noticia es que hay otras formas de hacerlo.

Los ejes

Puede utilizar ejes para localizar nodos en relación con otros nodos de contexto.

Exploremos algunas de ellas.

Los cuatro ejes principales

//p/ancestor::div => selects all the divs that are ancestors of <p>

How I read it: Get all the <p> tags, for each <p> look through its ancestors. If you find <div> tags, select them.

//p/parent::div => selects all the <div> tags that are parents of <p>

How I read it: Get all the <p> tags and of all their parents, if the parent is a <div>, select it.

//div/child::p=> selects all the <p> tags that are children of <div> tags.

How I read it: Get all the <div> tags and their children, if the child is a <p>, select it.

//div/descendant::p => selects all the <p> tags that are descendants of <div> tags.

How I read it: Get all the <div> tags and their descendants, if the descendant is a <p>, select it.

Ahora es el momento de reescribir la expresión anterior:

/html/body/div[./div[./p]] es equivalente a /html/body/div/div/p/parent::div/parent::div

Pero /html/body/div[.//p] NO es equivalente a /html/body/div//p/ancestor::div

La buena noticia es que podemos retocarlo un poco.

/html/body/div//p/ancestor::div[last()] es equivalente a /html/body/div[.//p]

Otros ejes importantes

//p/following-sibling::span => for each <p> tag, select its following <span> siblings.

//p/preceding-sibling::span => for each <p> tag, select its preceding <span> siblings.

//title/following::span => selects all the <span> tags that appear in the DOM after the <title>.

In our example, //title/following::span selects all the <span> tags in the document.

//p/preceding::div => selects all the <div> tags that appear in the DOM before any <p> tag. But it ignores ancestors, attribute nodes and namespace nodes.

In our case, //p/preceding::div only selects <div class="p-children"> and <div class="no_content">.

Most of the <p> tags are in <div class="content">, but this <div> is not selected because it is a common ancestor for them. As I mentioned, the 
preceding axe ignores ancestors.

<div class="p-children"> is selected because it is not an ancestor for the <p> tags inside <div class="p-children" id="important">

Resumen

Enhorabuena, lo has conseguido. Has añadido una nueva herramienta a tu caja de herramientas del selector. Si estás creando un raspador web o automatizando pruebas web, esta hoja de trucos de Xpath te será muy útil. Si estás buscando una manera más suave de atravesar el DOM, estás en el lugar correcto. En cualquier caso, vale la pena probar XPath. Quién sabe, tal vez descubra aún más casos de uso.
¿Le parece interesante el concepto de web scraping? Puede ponerse en contacto con nosotros aquí WebScrapingAPI - Contacto. Si quieres raspar la web, estaremos encantados de ayudarte en el camino. Mientras tanto, considere probar WebScrapingAPI - Producto de forma gratuita.