artvee_scraper package

Submodules

artvee_scraper.artvee_client module

class artvee_scraper.artvee_client.ArtveeClient(conn_timeout=3.05, read_timeout=10, max_attempts=3)[source]

Bases: object

HTTP client for interacting with the Artvee API.

Constants:
_HTTP_CONN_TIMEOUT_SEC (float):

Default number of seconds to wait to establish a connection to a remote machine.

_HTTP_READ_TIMEOUT_SEC (float):

Default number of seconds the client will wait for the server to send a response.

_ITEMS_PER_PAGE (int):

Maximum number of items to retrieve per page in API requests.

_TITLE_PATTERN (re.Pattern):

Regex pattern for extracting title and date. ex: Landscape with Weather Vane (1935); group 1 = title (ex: Landscape with Weather Vane), group 2 = date (ex: 1935)

_ARTIST_PATTERN (re.Pattern):

Regex pattern for extracting artist name and origin. ex: Arthur Dove(American, 1880-1946); group 1 = artist name (ex: Arthur Dove), group 2 = origin (ex: American, 1880-1946)

_RESOURCE_PATTERN (re.Pattern):

Regex pattern for extracting the resource name. ex: https://artvee.com/dl/zwei-tanzende/; group 1 = resource (ex: zwei-tanzende)

_IMG_DIMENSION_PATTERN (re.Pattern):

Regex pattern for extracting image dimensions. ex: 1800 x 1185px; group 1 = width (ex: 1800), group 2 = height (ex: 1185)

_IMG_FILE_SIZE_PATTERN (re.Pattern):

Regex pattern for extracting image file size and unit.ex: 1.82 MB; group 1 = size (ex: 1.82), group 2 = unit (ex: MB)

Attributes:
_timeout (tuple[float, float]):

Timeouts to use for HTTP requests.

_session (Session):

Allows persistance of parameters across HTTP requests.

Args:
conn_timeout (float, optional):

Number of seconds to wait to establish a connection to a remote machine. Defaults 3.05 seconds.

read_timeout (float, optional):

Number of seconds the client will wait for the server to send a response. Defaults to 10 seconds.

max_attempts (int, optional):

The maximum number of attempts (including the initial call). Must be between 1 and 10. Defaults to 3 (initial call + two retries).

Raises:
ValueError:

If conn_timeout is not positive. If read_timeout is not positive. If max_attempts is not in the range [1, 10].

get_image(img_metadata)[source]

Retrieve the image data.

Return type:

bytes

Args:
img_metadata (ImageMetadata):

Information that describes attributes of an artwork.

Returns:
bytes:

The raw JPG image data.

Raises:
requests.exceptions.HTTPError:

If the HTTP request returns an unsuccessful status code.

get_metadata(category, page)[source]

Retrieve artwork metadata for a specified category and page.

Return type:

List[Tuple[ArtworkMetadata, ImageMetadata]]

Args:
category (CategoryType):

The category for which to retrieve artwork metadata.

page (int):

The page number to retrieve the metadata from. Pages are indexed starting at 1.

Returns:
List[Tuple[ArtworkMetadata, ImageMetadata]]:

A list where each tuple represents the attributes of an artwork. ArtworkMetadata: Attributes of an artwork. ImageMetadata: Attributes of an image file.

Raises:
requests.exceptions.HTTPError:

If the HTTP request returns an unsuccessful status code.

get_page_count(category)[source]

Retrieve the total number of webpages for a given category.

Return type:

int

Args:
category (CategoryType):

The category for which to retrieve the page count.

Returns:

int: The total number of pages available for the specified category.

Raises:
requests.exceptions.HTTPError:

If the HTTP request returns an unsuccessful status code.

ValueError:

If the total items cannot be parsed / converted to an integer.

artvee_scraper.artwork module

class artvee_scraper.artwork.Artwork(url, resource, title, category, artist='Unknown Artist', date='n.d.', origin=None, image=None)[source]

Bases: ArtworkMetadata

Represents an artistic work.

Attributes:
image (Image, optional):

The image, including associated metadata. Defaults to None

image: Image | None = None
to_dict()[source]

A dict representation of this object.

The image will be encoded in Base64 binary.

Return type:

dict | None

Returns:

This object as a dict.

class artvee_scraper.artwork.ArtworkMetadata(url, resource, title, category, artist='Unknown Artist', date='n.d.', origin=None)[source]

Bases: object

Information that describes attributes of an artwork.

Attributes:
url (str):

Artwork URL (ex: https://artvee.com/dl/zwei-tanzende/)

resource (str):

Unique name of artwork; extracted from the URL (ex: zwei-tanzende)

title (str):

Name of the artwork (ex: Zwei Tanzende)

category (str):

Category the work of art is depicting (ex: abstract)

artist (str, optional):

Name of the person that created this artwork. Defaults to Unknown Artist

date (str, optional):

Year or date range the artwork was completed (ex: 2012 - 2019). Defaults to n.d., no date

origin (str, optional):

Artist nationality and lifespan (ex: Austrian, 1834-1921). Defaults to None

artist: str = 'Unknown Artist'
category: str
date: str = 'n.d.'
origin: str | None = None
resource: str
title: str
url: str
class artvee_scraper.artwork.CategoryType(value)[source]

Bases: Enum

Enumeration for different categories of art.

Attributes:
ABSTRACT (str):

Art that uses shapes, colors, forms, and gestural marks rather than aiming for an accurate representation of visual reality.

FIGURATIVE (str):

Art that represents recognizable subjects, particularly the human figure, focusing on real-world forms.

LANDSCAPE (str):

Art that depicts natural scenes, often focusing on the beauty and atmosphere of the environment.

RELIGION (str):

Art that conveys spiritual themes or depicts subjects related to faith, spirituality, and religious practices.

MYTHOLOGY (str):

Art that illustrates or interprets themes, characters, and stories from myths and legends, often exploring cultural beliefs and narratives.

POSTERS (str):

Art designed for display and promotion, often featuring bold imagery and text to convey a message or advertise events, products, or causes.

ANIMALS (str):

Art that focuses on the representation of animals, capturing their form, behavior, and characteristics.

ILLUSTRATION (str):

Art that creates images to accompany text or convey a narrative, often found in books, magazines, and advertising.

STILL_LIFE (str):

Art that depicts inanimate objects, such as fruits, flowers, and everyday items.

BOTANICAL (str):

Art that focuses on the representation of plants and flowers, often emphasizing accuracy and detail to depict their beauty and scientific characteristics.

DRAWINGS (str):

Art form created using various mediums, such as pencil, charcoal, or ink, to render images through lines and shading, often capturing ideas, sketches, or detailed representations.

ASIAN_ART (str):

Art that represents styles and techniques from Asian cultures, often reflecting traditional themes, motifs, and cultural significance.

ABSTRACT = 'abstract'
ANIMALS = 'animals'
ASIAN_ART = 'asian-art'
BOTANICAL = 'botanical'
DRAWINGS = 'drawings'
FIGURATIVE = 'figurative'
ILLUSTRATION = 'illustration'
LANDSCAPE = 'landscape'
MYTHOLOGY = 'mythology'
POSTERS = 'posters'
RELIGION = 'religion'
STILL_LIFE = 'still-life'
class artvee_scraper.artwork.Image(source_url=None, width=0, height=0, file_size=0, file_size_unit=None, raw=None, format_name='jpg')[source]

Bases: ImageMetadata

Represents a graphical image.

Attributes:
raw (bytes, optional):

The raw binary data of the image. Defaults to None

format_name (str, optional):

The format of the image file - identifies how the image should be processed/displayed. Defaults to jpg.

b64encoded()[source]

Returns the image encoded as Base64 binary.

Return type:

str | None

Returns:

Base64 binary encoded image.

format_name: str = 'jpg'
raw: bytes | None = None
to_dict()[source]

A dict representation of this object.

The image will be encoded in Base64 binary.

Return type:

dict | None

Returns:

This object as a dict.

class artvee_scraper.artwork.ImageMetadata(source_url=None, width=0, height=0, file_size=0, file_size_unit=None)[source]

Bases: object

Information that describes attributes of an image file.

Attributes:
source_url (str, optional):

The URL where the image is sourced from. Defaults to None

width (int, optional):

The width of the image in pixels. Defaults to 0

height (int, optional):

The height of the image in pixels. Defaults to 0

file_size (float, optional):

The size of the image file. Defaults to 0

file_size_unit (str, optional):

The unit of the file size (ex: “KB”, “MB”). Defaults to None

file_size: float = 0
file_size_unit: str | None = None
height: int = 0
source_url: str | None = None
width: int = 0

artvee_scraper.scraper module

class artvee_scraper.scraper.ArtveeScraper(artvee_client=<artvee_scraper.artvee_client.ArtveeClient object>, worker_threads=3, categories=(CategoryType.ABSTRACT, CategoryType.FIGURATIVE, CategoryType.LANDSCAPE, CategoryType.RELIGION, CategoryType.MYTHOLOGY, CategoryType.POSTERS, CategoryType.ANIMALS, CategoryType.ILLUSTRATION, CategoryType.STILL_LIFE, CategoryType.BOTANICAL, CategoryType.DRAWINGS, CategoryType.ASIAN_ART))[source]

Bases: object

A web scraper which concurrently extracts artwork from Artvee. Callbacks are notified asynchronously for each scraped artwork so that user-defined actions may be taken.

Attributes:
_artvee_client (ArtveeClient):

An HTTP client for accessing artwork.

_worker_pool (ThreadPoolExecutor):

A pool of threads to manage concurrent scraping tasks.

_categories (Tuple[CategoryType]):

Category types to scrape.

_boss_thread (Thread):

The main thread responsible for executing the scraping logic; delegates tasks to workers.

_stop_event (Event):

Signal to indicate the scraping process should be halted.

_listener_lock (Lock):

A lock which provides access the listeners in a thread-safe manner.

_listeners (dict):

A hashset of callbacks to invoke asynchronously.

Args:
artvee_client (ArtveeClient, optional):

An HTTP client for accessing artwork. Defaults to a new instance.

worker_threads (int, optional):

The number of worker threads to use for processing. Must be between 1 and 10. Defaults to 3.

categories (Tuple[CategoryType], optional):

Category types to scrape. Defaults to all categories.

Raises:
ValueError:

If worker_threads is not in the range [1, 10].

deregister_listener(on_event_listener)[source]

Deregisters a callback so that it will no longer be notified of events.

Return type:

Self

Args:
on_event_listener (Callable[[Artwork, Exception | None]], None]):

The callback to no longer notify when an event occurs.

Returns:

self for method chaining

join()[source]

Blocks the calling thread until all active tasks have been completed.

Return type:

None

Returns:

None

register_listener(on_event_listener)[source]

Registers a callback to be notified of events asynchronously.

A callback may only be registered once; additional attempts to register the same callback will overwrite the previous registration. The success notification contains a fully populated Artwork, whereas the failure notification contains a partially populated Artwork and associated exception.

Return type:

Self

Args:
on_event_listener (Callable[[Artwork, Exception | None], None]):

A callback function that will be notified async when an event occurs.

Raises:
ValueError:

If the provided on_event_listener argument is not callable.

Returns:

self for method chaining

shutdown()[source]

Initiates a shutdown in which running tasks are completed, but all pending tasks are canceled.

Return type:

None

Returns:

None

start()[source]

Begin scraping artwork

Return type:

None

Returns:

None

Module contents