webutil 

Reference documentation.

flask_util 

Utilities for Flask. View classes, decorators, URL route converters, etc.

class RegexConverter(url_map, *items)[source]

Bases: BaseConverter

Regexp URL route for Werkzeug/Flask.

Based on https://github.com/rhyselsmore/flask-reggie.

Usage:

@app.route('/<regex("abc|def"):letters>')

Install with:

app.url_map.converters['regex'] = RegexConverter

get_required_param(name)[source]

Returns the given request parameter.

If it’s not in a query parameter or POST field, the current HTTP request aborts with status 400.

ndb_context_middleware(app, client=None, **kwargs)[source]

WSGI middleware to add an NDB context per request.

Follows the WSGI standard. Details: http://www.python.org/dev/peps/pep-0333/

Install with eg:

ndb_client = ndb.Client()
app = Flask('my-app')
app.wsgi_app = flask_util.ndb_context_middleware(app.wsgi_app, ndb_client)

Background: https://cloud.google.com/appengine/docs/standard/python3/migrating-to-cloud-ndb#using_a_runtime_context_with_wsgi_frameworks

Parameters:

client – google.cloud.ndb.Client
kwargs – passed through to google.cloud.ndb.Client.context()

handle_exception(e)[source]

Flask error handler that propagates HTTP exceptions into the response.

Install with:

app.register_error_handler(Exception, handle_exception)

error(msg, status=400, exc_info=False, **kwargs)[source]

Logs and returns an HTTP error via werkzeug.exceptions.HTTPException.

Parameters:

msg (str)
status (int)
exc_info – Python exception info three-tuple, eg from sys.exc_info()
kwargs – passed through to flask.abort()

flash(msg, **kwargs)[source]: Wrapper for flask.flash`() that also logs the message.

default_modern_headers(resp)[source]

Include modern HTTP headers by default, but let the response override them.

Install with:

app.after_request(default_modern_headers)

cached(cache, timeout, headers=(), http_5xx=False)[source]

Thin flask-cache wrapper that supports timedelta and cache query param.

If the cache URL query parameter is false, skips the cache. Also, does not store the response in the cache if it’s an HTTP 5xx or if there are any flashed messages.

Parameters:

cache (flask_caching.Cache)
timeout (datetime.timedelta)
headers – sequence of str, optional headers to include in the cache key
http_5xx (bool) – optional, whether to cache HTTP 5xx (server error) responses

headers(headers, error_codes=(404,))[source]

Flask decorator that adds headers to the response.

Parameters:

headers (dict mapping str header name to str value)
error_codes (sequence of int) – 4xx and 5xx HTTP codes to include the headers with, along with 2xx and 3xx.

cloud_tasks_only(log=True)[source]

Flask decorator that returns HTTP 401 if the request isn’t from Cloud Tasks.

(…or from App Engine Cron.)

https://cloud.google.com/tasks/docs/creating-appengine-handlers#reading-headers https://cloud.google.com/appengine/docs/standard/scheduling-jobs-with-cron-yaml#securing_urls_for_cron

Must be used below flask.Flask.route(), eg:

@app.route('/path')
@cloud_tasks_only()
def handler():
    ...

Parameters:: log (boolean) – whether to log the task name. If None, task name is logged only if the traceparent HTTP header is not set.

canonicalize_domain(from_domains, to_domain)[source]

WSGI middleware that redirects one or more domains to a canonical domain.

Preserves scheme, path, and query.

Install with eg:

app = flask.Flask(...)
app.before_request(canonicalize_domain(('old1.com', 'old2.org'), 'new.com'))

Parameters:

from_domains – str or sequence of str
to_domain – str

canonicalize_request_domain(from_domains, to_domain)[source]

Flask handler decorator that redirects to a canonical domain.

Use below flask.Flask.route(), eg:

@app.route('/path')
@canonicalize_request_domain('foo.com', 'bar.com')
def handler():
    ...

Parameters:

from_domains – str or sequence of str
to_domain – str

class XrdOrJrd[source]

Bases: View

Renders and serves an XRD or JRD file.

JRD is served if the request path ends in .jrd or .json, or the format query parameter is jrd or json, or the request`s Accept header includes jrd or json.

XRD is served if the request path ends in .xrd or .xml, or the format query parameter is xml or xrd, or the request’s Accept header includes xml or xrd.

Otherwise, defaults to DEFAULT_TYPE.

Subclasses must override template_prefix()`() and template_vars()`(). URL route variables are passed through to template_vars()`() as keyword args.

DEFAULT_TYPE = 'jrd': Either JRD or which, the type to return by default if the request doesn’t ask for one explicitly with the Accept header.

template_prefix()[source]: Returns template filename, without extension.

template_vars(**kwargs)[source]

Returns a dict with template variables.

URL route variables are passed through as kwargs.

class FlashErrors[source]

Bases: View

Wraps a Flask flask.view.View and flashes errors.

Mostly used with OAuth endpoints.

instance_info 

Renders vital stats about a single App Engine instance.

Intended for developers, not users. To turn on concurrent request recording, add the middleware and InfoHandler to your WSGI application, eg:

from oauth_dropins.webutil.instance_info import concurrent_requests_wsgi_middleware, info
application = concurrent_requests_wsgi_middleware(WSGIApplication([
    ...
    ('/_info', info),
])

class Concurrent(count, when): Bases: tuple

info()[source]: Flask handler that renders current instance info.

concurrent_requests_wsgi_middleware(app)[source]

WSGI middleware for per request instance info instrumentation.

Follows the WSGI standard. Details: http://www.python.org/dev/peps/pep-0333/

logs 

A handler that serves all app logs for an App Engine HTTP request.

StackDriver Logging API: https://cloud.google.com/logging/docs/apis

sanitize(msg)[source]: Sanitizes access tokens and Authorization headers.

url(when, key, **params)[source]

Returns the relative URL (no scheme or host) to a log page.

Parameters:

when (datetime)
key (ndb.Key or str)
params – included as query params, eg module, path

maybe_link(when, key, time_class='dt-updated', link_class='', **params)[source]

Returns an HTML snippet with a timestamp and maybe a log page link.

Example:

<a href="/log?start_time=1513904267&key=aglz..." class="u-bridgy-log">
  <time class="dt-updated" datetime="2017-12-22T00:57:47.222060"
          title="Fri Dec 22 00:57:47 2017">
    3 days ago
  </time>
</a>

The <a> tag is only included if the timestamp is 30 days old or less, since Stackdriver’s basic tier doesn’t store logs older than that: * https://cloud.google.com/monitoring/accounts/tiers#logs_ingestion * https://github.com/snarfed/bridgy/issues/767

Parameters:

when (datetime)
key (ndb.Key or str)
time_class (str) – optional class value for the <time> tag
link_class (str) – optional class value for the <a> tag (if generated)
(dict (params) – str): query params to include in the link URL, eg module, path
str – str): query params to include in the link URL, eg module, path

Returns: string HTML

linkify_datastore_keys(msg)[source]: Converts string datastore keys to links to the admin console viewer.

log(module=None, path=None)[source]

Flask view that searches for and renders app logs for an HTTP request.

URL parameters:

start_time (float): seconds since the epoch
key (str): token to find in the first app log of the request

Install with:

app.add_url_rule('/log', view_func=logs.log)

Or:

@app.get('/log')
@cache.cached(600)
def log():
  return logs.log()

Parameters:

module (str) – App Engine module to search. Defaults to all.
path (str or sequence of str) – optional HTTP request path(s) to limit logs to.

Returns:

Flask response

Return type:

(str response body, dict headers) tuple

models 

App Engine datastore model base classes, properties, and utilites.

class StringIdModel(**kwargs)[source]

Bases: Model

An ndb.Model class that requires a string id.

put(*args, **kwargs)[source]: Raises AssertionError if string id is not provided.

class JsonProperty(*args, **kwargs)[source]

Bases: TextProperty

Fork of ndb’s that subclasses ndb.TextProperty instead of ndb.BlobProperty.

This makes values show up as normal, human-readable, serialized JSON in the web console. https://github.com/googleapis/python-ndb/issues/874#issuecomment-1442753255

Duplicated in arroba: https://github.com/snarfed/arroba/blob/main/arroba/ndb_storage.py

class ComputedJsonProperty(*args, **kwargs)[source]

Bases: JsonProperty, ComputedProperty

Custom ndb.ComputedProperty for JSON values that stores them as strings.

…instead of like ndb.StructuredProperty, with “entity” type, which bloats them unnecessarily in the datastore.

class EnumProperty(enum_class, **kwargs)[source]

Bases: IntegerProperty

Property for storing Python Enum values.

Stores the enum’s integer value in the datastore.

class EncryptedProperty(name=None, compressed=None, indexed=None, repeated=None, required=None, default=None, choices=None, validator=None, verbose_name=None, write_empty_list=None)[source]

Bases: BlobProperty

Property that stores encrypted bytes.

Encrypts bytes values using AES-256-GCM before storing in the datastore, and decrypts them when reading back.

The AES-256-GCM key should be in the encrypted_property_key file, base64 encoded. Here’s example code to generate an AES-256-GCM key and base64 encode it:

import base64 from cryptography.hazmat.primitives.ciphers.aead import AESGCM

key_bytes = AESGCM.generate_key(bit_length=256) print(base64.b64encode(key_bytes))

testutil 

Unit test utilities.

requests_response(body='', url=None, status=200, content_type=None, redirected_url=None, headers=None, allow_redirects=None, encoding=None)[source]

Parameters:: redirected_url (str sequence of str) – URL(s) for multiple redirects

enable_flask_caching(app, cache)[source]

Test case decorator that enables a flask_caching cache.

Usage:

from app import app, cache

class FooTest(TestCase):
  @enable_flask_caching(app, cache)
  def test_foo(self):
    ..

Parameters:

app (Flask)
cache (flask_caching.Cache)

class UrlopenResult(status_code, content, url=None, headers={})[source]

Bases: object

A fake urllib.request.urlopen() or urlfetch.fetch() result object.

class Asserts[source]

Bases: object

Test case mixin class with extra assert helpers.

assert_entities_equal(a, b, ignore=frozenset({}), keys_only=False, in_order=False)[source]

Asserts that a and b are equivalent entities or lists of entities.

…specifically, that they have the same property values, and if they both have populated keys, that their keys are equal too.

Parameters:

a (ndb.Model) – instances or lists of instances
b (ndb.Model) – same
ignore (sequence of str) – property names not to compare
keys_only (bool) – if True only compare keys
in_order (bool) – if False, all entities must have keys

entity_keys(entities)[source]: Returns a list of keys for a list of entities.

assert_equals(expected, actual, msg=None, in_order=False, ignore=())[source]

Pinpoints individual element differences in lists and dicts.

If in_order is False, ignores order in lists and tuples.

assert_multiline_equals(expected, actual, ignore_blanks=False)[source]

Compares two multi-line strings and reports a diff style output.

Ignores leading and trailing whitespace on each line, and squeezes repeated blank lines down to just one.

Parameters:: ignore_blanks (boolean) – whether to ignore blank lines altogether

assert_multiline_in(expected, actual, ignore_blanks=False)[source]

Checks that a multi-line string is in another and reports a diff output.

Ignores leading and trailing whitespace on each line, and squeezes repeated blank lines down to just one.

Parameters:: ignore_blanks (boolean) – whether to ignore blank lines altogether

class TestCase(methodName='runTest')[source]

Bases: MoxTestBase, Asserts

Test case class with lots of extra helpers.

stub_requests_head()[source]: Automatically return 200 to outgoing HEAD requests.

unstub_requests_head()[source]: Mock outgoing HEAD requests so they must be expected individually.

expect_urlopen(url, response=None, status=200, data=None, headers=None, response_headers={}, **kwargs)[source]

Stubs out urllib.request.urlopen() and sets up an expected call.

If status isn’t 2xx, makes the expected call raise a urllib.error.HTTPError instead of returning the response.

If data is set, url must be a urllib.request.Request.

If response is unset, returns the expected call.

Parameters:

url (str, re.RegexObject, urllib.request.Request, or webob.request.Request)
response (str)
status (int) – HTTP response code
data (str) – optional POST body
headers (dict) – optional expected request headers
response_headers (dict) – optional response headers
kwargs – other keyword args, e.g. timeout

util 

Misc web-related utilities.

user_agent = 'webutil (https://github.com/snarfed/webutil)': Set with set_user_agent().

HTTP_TIMEOUT = 15: Default HTTP request timeout, used in requests_get() etc.

MAX_HTTP_RESPONSE_SIZE = 2000000

Average HTML size as of 2015-10-15 is 56K, so this is generous and conservative. Raised from 1MB to 2MB on 2023-07-07.

now(tz=datetime.timezone.utc)[source]: Alias, allows unit tests to mock the function.

beautifulsoup_parser = None: Global config, string parser for BeautifulSoup to use, e.g. ‘lxml’. May be set at runtime. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

LINK_RE = re.compile('\\b(?:[a-z]{3,9}:/{1,3})?(?<![@＠])(?:[^\\s.!"#$%&\'()*+,/:;<=>?@[\\]^_`{|}~＠:﹕：]+\\.)+[a-z]{2,}(?::\\d{2,6})?(?:(?:/[\\w/.\\-_~.;:%?@$#&()=+]*)|\\b)', re.IGNORECASE)

Regexps for domains, hostnames, and URLs.

Based on kylewm’s from redwind:

I used to use a more complicated regexp based on https://github.com/silas/huck/blob/master/huck/utils.py#L59 , but i kept finding new input strings that would make it hang the regexp engine.

More complicated alternatives:

List of TLDs: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#ICANN-era_generic_top-level_domains

Allows emoji and other unicode chars in all domain labels except TLDs. TODO: support IDN TLDs:

TODO: fix bug in LINK_RE that makes it miss emoji domain links without scheme, eg ☕⊙.ws. bug is that the \b at the beginning of SCHEME_RE doesn’t apply to emoji, since they’re not word-constituent characters, and that the ? added in LINK_RE only applies to the parenthesized group in SCHEME_RE, not the \b. I tried changing \b to '(?:^|[\s%s])' % PUNCT, but that broke other things.

class Struct(**kwargs)[source]

Bases: object

A generic class that initializes its attributes from constructor kwargs.

class CacheDict[source]

Bases: dict

A dict that also implements memcache’s get_multi and set_multi methods.

Useful as a simple in memory replacement for App Engine’s memcache API for e.g. granary.Source.get_activities_response().

to_xml(value)[source]: Renders a dict (usually from JSON) as an XML snippet.

trim_nulls(value, ignore=())[source]

Recursively removes dict and list elements with None or empty values.

Parameters:

value (dict or list)
ignore (sequence) – optional, keys that may have None/empty values. Transitive: ignored keys’ entire contents are ignored and allowed to have nulls, all the way down!

uniquify(input)[source]

Returns a list with duplicate items removed.

Like list(set(...)), but preserves order.

get_list(obj, key)[source]

Returns a value from a dict as a list.

If the value is a list or tuple, it’s converted to a list. If it’s something else, it’s returned as a single-element list. If the key doesn’t exist, returns [].

pop_list(obj, key)[source]: Like get_list(), but also removes the item.

add(seq, val)[source]

Appends val to seq if seq doesn’t already contain it.

Useful for treating repeated ndb properties like sets instead of lists.

Returns:: True if val was added to seq, ie it wasn’t already in seq, False otherwise

remove(seq, val)[source]

Removes val from seq if seq contains it.

Useful for treating repeated ndb properties like sets instead of lists.

encode(obj, encoding='utf-8')[source]

Character encodes all unicode strings in a collection, recursively.

Parameters:

obj (list, tuple, dict, set, or primitive)
encoding (str) – character encoding

Returns:

obj with all unicode strings encoded

Return type:

sequence or dict

get_first(obj, key, default=None)[source]

Returns the first element of a dict value.

If the value is a list or tuple, returns the first value. If it’s something else, returns the value itself. If the key doesn’t exist, returns None.

get_url(val, key=None)[source]

Returns val['url'] if val is a dict, otherwise val.

If key is not None, looks in val[key] instead of val.

get_urls(obj, key, inner_key=None)[source]

Returns elem['url'] if dict, otherwise elem, for each elem in obj[key].

If inner_key is provided, the returned values are elem[inner_key]['url'].

tag_uri(domain, name, year=None)[source]

Returns a tag URI string for the given domain and name.

Example return value: tag:twitter.com,2012:snarfed_org/172417043893731329

Background on tag URIs: http://taguri.org/

parse_tag_uri(uri)[source]

Returns the domain and name in a tag URI string.

Inverse of tag_uri().

Returns:: (str domain, str name) tuple, or None if the tag URI couldn’t be parsed

parse_acct_uri(uri, hosts=None)[source]

Parses acct: URIs of the form acct:user@example.com .

Background: http://hueniverse.com/2009/08/making-the-case-for-a-new-acct-uri-scheme/

Parameters:

uri (str)
hosts (sequence of str) – allowed hosts, usually domains. None means allow all

Returns:

(username, host)

Return type:

(str, str) tuple

Raises: ValueError if the uri is invalid or the host isn’t allowed.

domain_from_link(url, minimize=True)[source]

Extracts and returns the meaningful domain from a URL.

Parameters:

url (string)
minimize (bool) – if true, strips www., mobile., and m. subdomains from the beginning of the domain

Returns:

optional, None if url is None or blank

Return type:

str

domain_or_parent_in(input, domains)[source]

Returns True if an input domain or its parent is in a set of domains.

Examples:

foo, [] => False
foo, [foo] => True
foo.bar.com, [bar.com] => True
foobar.com, [bar.com] => False
foo.bar.com, [.bar.com] => True
foo.bar.com, [fux.bar.com] => False
bar.com, [fux.bar.com] => False

Parameters:

input (str) – domain or URL
domains (sequence of str) – domain

Returns:

bool

update_scheme(url, request)[source]

Returns a modified URL with the scheme upgraded to https if the request uses https.

Useful for converting URLs to https if and only if the current request itself is being served over https.

Parameters:

url (str)
request (flask.Request or webob.Request)

Returns:

URL

Return type:

str

schemeless(url, slashes=True)[source]

Strips the scheme (e.g. https:) from a URL.

Parameters:

url (str)
slashes (bool) – if False, also strips leading slashes and trailing slash, e.g. http://example.com/ becomes example.com

Returns:

URL

Return type:

str

fragmentless(url)[source]

Strips the fragment (e.g. ‘#foo’) from a URL.

Parameters:: url (str)
Returns:: URL
Return type:: str

clean_url(url)[source]

Removes transient query params (e.g. utm_*) from a URL.

The utm_* (Urchin Tracking Metrics?) params come from Google Analytics. https://support.google.com/analytics/answer/1033867

The source=rss-... params are on all links in Medium’s RSS feeds.

Parameters:: url (str)
Returns:: the cleaned url, or None if it can’t be parsed
Return type:: str

quote_path(url)[source]

Quotes (URL-encodes) just the path part of a URL.

Parameters:: url (str)
Returns:: the quoted url, or None if it can’t be parsed
Return type:: str

base_url(url)[source]

Returns the base of a given URL.

For example, returns http://site/posts/ for http://site/posts/123.

Parameters:: url (str)

is_web(url)[source]: Returns True if the argument is an http or https URL, False otherwise.

is_url(url)[source]

Returns True if the argument is a URL, False otherwise.

Very dumb, just checks for scheme, host/netloc, and no whitespace.

extract_links(text)[source]

Returns a list of unique string URLs in the given text.

URLs in the returned list are in the order they first appear in the text.

tokenize_links(text, skip_bare_cc_tlds=False, skip_html_links=True, require_scheme=False)[source]

Splits text into link and non-link text.

Parameters:

text (str) – to linkify
skip_bare_cc_tlds (bool) – whether to skip links of the form [domain].[2-letter TLD] with no schema and no path
skip_html_links (bool) – whether to skip links in HTML <a> tags (both href and text)
require_scheme (bool) – whether to require scheme (eg http://)

Returns:

list of links and list of non-link text. Roughly equivalent to the output of re.findall() and re.split(), with some post-processing.

Return type:

(sequence of str, sequence of str) tuple

linkify(text, pretty=False, skip_bare_cc_tlds=False, **kwargs)[source]

Adds HTML links to URLs in the given plain text.

For example: linkify('Hello http://tornadoweb.org!') would return Hello <a href="http://tornadoweb.org">http://tornadoweb.org</a>!

Ignores URLs that are inside HTML links, ie anchor tags that look like <a href="...">.

Parameters:

text (str) – input
pretty (bool) – if True, uses pretty_link() for link text
skip_bare_cc_tlds (bool) – whether to skip links of the form [domain].[2-letter TLD] with no schema and no path

Returns:

linkified input

Return type:

str

pretty_link(url, text=None, text_prefix=None, keep_host=True, glyphicon=None, attrs=None, new_tab=False, max_length=None)[source]

Renders a pretty, short HTML link to a URL.

If text is not provided, the link text is the URL without the leading http(s)://[www.], ellipsized at the end if necessary. URL escape characters and UTF-8 are decoded.

The default maximum length follow’s Twitter’s rules: full domain plus 15 characters of path (including leading slash).

Parameters:

url (str)
text (str) – optional
text_prefix (str) – optional, added to beginning of text
keep_host (bool) – if False, remove the host from the link text
glyphicon (str) – glyphicon to render after the link text, if provided. Details: http://glyphicons.com/
attrs (dict) – attributes => values to include in the a tag. optional
new_tab (bool) – include target="_blank" if True
max_length (int) – max link text length in characters. ellipsized beyond this.

Returns:

HTML snippet with <a> tag

Return type:

str

parse_iso8601(val)[source]

Parses an ISO 8601 or RFC 3339 date/time string and returns a datetime.

Time zone designator is optional. If present, the returned datetime will be time zone aware.

Parameters:: val (str) – ISO 8601 or RFC 3339, e.g. 2012-07-23T05:54:49+00:00
Returns:: datetime.datetime

parse_iso8601_duration(input)[source]

Parses an ISO 8601 duration.

Note: converts months to 30 days each. ISO 8601 doesn’t seem to define the number of days in a month. Background: https://stackoverflow.com/a/29458514/186123

Parameters:: input (str) – ISO 8601 duration, e.g. P3Y6M4DT12H30M5S

https://en.wikipedia.org/wiki/ISO_8601#Durations

Returns:: …or None if input cannot be parsed as an ISO 8601 duration
Return type:: datetime.timedelta

to_iso8601_duration(input)[source]

Converts a timedelta to an ISO 8601 duration.

Returns a fairly strict format: PnMTnS. Fractional seconds are silently dropped.

Parameters:: input (datetime.timedelta)

https://en.wikipedia.org/wiki/ISO_8601#Durations

Returns:: ISO 8601 duration, e.g. P3DT4S
Return type:: str

Raises: TypeError if delta is not a datetime.timedelta

maybe_iso8601_to_rfc3339(input)[source]

Tries to convert an ISO 8601 date/time string to RFC 3339.

The formats are similar, but not identical, eg. RFC 3339 includes a colon in the timezone offset at the end (+0000 instead of +00:00), but ISO 8601 doesn’t.

If the input can’t be parsed as ISO 8601, it’s silently returned, unchanged!

http://www.rfc-editor.org/rfc/rfc3339.txt

maybe_timestamp_to_rfc3339(input)[source]

Tries to convert a string or int UNIX timestamp to RFC 3339.

Assumes UNIX timestamps are always UTC. (They’re generally supposed to be.)

maybe_timestamp_to_iso8601(input)[source]

Tries to convert a string or int UNIX timestamp to ISO 8601.

Assumes UNIX timestamps are always UTC. (They’re generally supposed to be.)

to_utc_timestamp(input)[source]: Converts a datetime to a float POSIX timestamp (seconds since epoch).

as_utc(input)[source]

Converts a timezone-aware datetime to a naive UTC datetime.

If input is timezone-naive, it’s returned as is.

Doesn’t support DST!

naturaltime(val, when=None, **kwargs)[source]

Wrapper for humanize.naturaltime that handles timezone-aware datetimes.

…since humanize currently doesn’t. :( https://github.com/python-humanize/humanize/issues/17

ellipsize(str, words=14, chars=140)[source]

Truncates and ellipsizes str if it’s longer than words or chars.

Words are simply tokenized on whitespace, nothing smart.

add_query_params(url, params)[source]

Adds new query parameters to a URL. Encodes as UTF-8 and URL-safe.

Parameters:

url (str) – URL or urllib.request.Request. May already have query parameters.
params (dict or list of (str key, str value) tuples) – Keys may repeat.

Returns:

URL

Return type:

str

remove_query_param(url, param)[source]

Removes query parameter(s) from a URL. Decodes URL escapes and UTF-8.

If the query parameter is not present in the URL, the URL is returned unchanged, and the returned value is None.

If the query parameter is present multiple times, only the last value is returned.

Parameters:

url (str) – URL
param (str) – name of query parameter to remove

Returns:

(URL without the given param, param value)

Return type:

(str, str) tuple

dedupe_urls(urls, key=None, trailing_slash=True)[source]

Normalizes and de-dupes http(s) URLs.

Converts domain to lower case, optionally adds trailing slash when path is empty, and ignores scheme (http vs https), preferring https. Preserves order. Removes Nones and blank strings.

Domains are case insensitive, even modern domains with Unicode/punycode characters:

As examples, http://foo/ and https://FOO are considered duplicates, but http://foo/bar and http://foo/bar/ aren’t.

Background: https://en.wikipedia.org/wiki/URL_normalization

Parameters:

urls (sequence of str) – URLs or dict objects with url keys
key (str) – optional, inner key to be dereferenced in a dict object before looking for the url key
trailing_slash (bool) – whether to add trailing slash if it’s missing

Returns:

URLs

Return type:

sequence of str

encode_oauth_state(obj)[source]

The state parameter is passed to various source authorization endpoints and returned in a callback. This encodes a JSON object so that it can be safely included as a query string parameter.

Parameters:: obj (dict) – JSON-serializable
Returns:: str

decode_oauth_state(state)[source]

Decodes a state parameter encoded by encode_state_parameter().

Parameters:: state (str) – JSON-serialized dict, or None
Returns:: dict

if_changed(cache, updates, key, value)[source]

Returns a value if it’s different from the cached value, otherwise None.

Values that evaluate to False are considered equivalent to None, in order to save cache space.

If the values differ, updates[key] is set to value. You can use this to collect changes that should be made to the cache in batch. None values in updates mean that the corresponding key should be deleted.

Parameters:

cache – any object with a get(key) method
updates (dict)
key – anything supported by cache
value – anything supported by cache

Returns:

value or None

generate_secret()[source]

Generates a URL-safe random secret string.

Uses App Engine’s os.urandom(), which is designed to be cryptographically secure: http://code.google.com/p/googleappengine/issues/detail?id=1055

Parameters:: bytes (int) – length of string to generate
Returns:: str

is_int(arg)[source]: Returns True if arg can be converted to an integer, False otherwise.

is_float(arg)[source]: Returns True if arg can be converted to a float, False otherwise.

is_base64(arg)[source]: Returns True if arg is a base64 encoded string, False otherwise.

sniff_json_or_form_encoded(value)[source]

Detects whether value is JSON or form-encoded, parses and returns it.

Parameters:: value (str)
Returns:: dict if form-encoded, dict or list if JSON, otherwise str
Return type:: dict or list or str

interpret_http_exception(exception)[source]

Extracts the status code and response from different HTTP exception types.

Parameters:

exception (Exception) –

an HTTP request exception. Supported types:

apiclient.errors.HttpError
webob.exc.WSGIHTTPException
gdata.client.RequestError
oauth2client.client.AccessTokenRefreshError
requests.HTTPError
urllib.error.HTTPError
urllib.error.URLError
werkzeug.exceptions.HTTPException

Returns:

(str status code or None, str response body or None)

is_connection_failure(exception)[source]

Returns True if the given exception is a network connection failure.

…False otherwise.

class FileLimiter(file_obj, read_limit)[source]

Bases: object

A file object wrapper that reads up to a limit and then reports EOF.

From http://stackoverflow.com/a/29838711/186123 . Thanks SO!

read(filename)[source]: Returns the contents of filename, or None if it doesn’t exist.

load_file_lines(file)[source]

Reads lines from a file and returns them as a set.

Leading and trailing whitespace is trimmed. Blank lines and lines beginning with # (ie comments) are ignored.

Parameters:: file (str or file) – either a string filename or a file object or other iterable that returns lines
Return type:: set of str

json_loads(*args, **kwargs)[source]: Wrapper around json.loads() that centralizes our JSON handling.

json_dumps(*args, **kwargs)[source]: Wrapper around json.dumps() that centralizes our JSON handling.

set_user_agent(val)[source]

Sets the user agent to be sent in urlopen() and requests_fn().

Parameters:: val (str)

urlopen(url_or_req, *args, **kwargs)[source]

Wraps urllib.request.urlopen() and logs the HTTP method and URL.

Use set_user_agent() to change the User-Agent header to be sent.

requests_fn(fn)[source]

Wraps requests.* and logs the HTTP method and URL.

Use set_user_agent() to change the User-Agent header to be sent.

Parameters:

fn (str) – ‘head’, ‘get’, or ‘post’

Returns:

drop-in replacement for requests.get() etc

The gateway kwarg is a bool for whether this is in a HTTP gateway request handler context. If True, errors will be raised as appropriate Flask HTTP exceptions. Malformed URLs result in werkzeug.exceptions.BadRequest (HTTP 400), connection failures and HTTP 4xx and 5xx result in werkzeug.exceptions.BadGateway (HTTP 502).

Return type:

callable, (str url, gateway=None, **kwargs) => requests.Response

requests_post_with_redirects(url, *args, **kwargs)[source]

Make an HTTP POST, and follow redirects with POST instead of GET.

Violates the HTTP spec’s rule to follow POST redirects with GET. Yolo!

Parameters:: url (str)
Returns:: requests.Response
Raises:: TooManyRedirects –

follow_redirects(url, **kwargs)[source]

Fetches a URL with HEAD, repeating if necessary to follow redirects.

Caches results for 1 day by default. To bypass the cache, use follow_redirects.__wrapped__(…).

Does not raise an exception if any of the HTTP requests fail, just returns the failed response. If you care, be sure to check the returned response’s status code!

Parameters:

url (str)
kwargs – passed to requests.head()

Returns:

from the final request. The url attribute has: the final URL.

Return type:

requests.Response

class UrlCanonicalizer(scheme='https', domain=None, subdomain=None, approve=None, reject=None, query=False, fragment=False, trailing_slash=False, redirects=True, headers=None)[source]

Bases: object

Converts URLs to their canonical form.

If an input URL matches approve or reject, it’s automatically approved as is without following redirects.

If we HEAD the URL to follow redirects and it returns 4xx or 5xx, we return None.

class WideUnicode(*args, **kwargs)[source]

Bases: str

String class with consistent indexing and len() on narrow and wide Python.

PEP 261 describes that Python 2 builds come in “narrow” and “wide” flavors. Wide is configured with --enable-unicode=ucs4, which represents Unicode high code points above the 16-bit Basic Multilingual Plane in unicode strings as single characters. This means that len(), indexing, and slices of unicode strings use Unicode code points consistently.

Narrow, on the other hand, represents high code points as “surrogate pairs” of 16-bit characters. This means that len(), indexing, and slicing unicode strings does not always correspond to Unicode code points.

Mac OS X, Windows, and older Linux distributions have narrow Python 2 builds, while many modern Linux distributions have wide builds, so this can cause platform-specific bugs, e.g. with many commonly used emoji.

Docs:

Inspired by: http://stackoverflow.com/a/9934913

Related work:

On StackOverflow:

remove_invisible_chars(val)[source]

Removes invisible/non-printable Unicode control characters from a string.

Parameters:: val (str)
Returns:: val with invisible characters removed
Return type:: str

parse_html(input, **kwargs)[source]

Parses an HTML string with BeautifulSoup.

Uses the HTML parser currently set in the beautifulsoup_parser global. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

We generally try to use the same parser and version in prod and locally, since we’ve been bit by at least one meaningful difference between lxml and e.g. html5lib: lxml includes the contents of <noscript> tags, html5lib omits them: https://github.com/snarfed/bridgy/issues/798#issuecomment-370508015

Also lxml is noticeably faster than the others.

Specifically, projects like oauth-dropins, granary, and bridgy all use lxml explicitly.

Parameters:

input – (str or requests.Response): input HTML
kwargs – passed through to bs4.BeautifulSoup constructor

Returns:

bs4.BeautifulSoup

parse_mf2(input, url=None, id=None, metaformats=None)[source]

Parses microformats2 out of HTML.

Currently uses mf2py.

Parameters:

input – (str, bs4.BeautifulSoup, or requests.Response)
url (str) – optional, URL of the input page, used as the base for relative URLs
id (str) – optional id of specific element to extract and parse. defaults to the whole page.
metaformats (bool) – if True, extract and include data from metaformats, https://microformats.org/wiki/metaformats , as well as from mf2. The generated item will be h-card for home pages (ie URL path /), h-entry otherwise.

Returns:

parsed mf2 data, or {} if the HTML can’t be parsed, eg mf2py raises: RecursionError, https://github.com/microformats/mf2py/issues/78, or None if id is provided and not found in the input HTML

Return type:

dict

parse_metaformats(soup, url, type='h-card')[source]

Converts metadata in an HTML page to a microformats2 item.

Approximately implements the metaformats standard, https://microformats.org/wiki/metaformats , and includes extras like <title>, <meta description>, and <link rel=icon>.

More background: https://github.com/microformats/mf2py/pull/213

Parameters:

soup – (bs4.BeautifulSoup): parsed input HTML page
url (str) – optional, URL of the input page, used as the base for relative URLs
type (str) – optional, type of the returned mf2 item

Returns:

parsed mf2 item, or None if no metadata is available

Return type:

dict

parse_http_equiv(content)[source]

Parses the value in the http_equiv meta field and returns the url.

Parameters:: content (str) – http_equiv content str: https://www.w3.org/TR/WCAG20-TECHS/H76.html#procedure
Returns:: empty if content format is incorrect
Return type:: str

fetch_http_equiv(input, **kwargs)[source]

Fetches http_equiv meta tag, if available.

Parameters:: input (str, bs4.BeautifulSoup, or requests.Response)
Returns:: empty if not available or a url if available
Return type:: str

fetch_mf2(url, get_fn=<function requests_fn.<locals>.call>, gateway=False, require_backlink=None, metaformats=False, **kwargs)[source]

Fetches an HTML page over HTTP, parses it, and returns its microformats2.

If url includes a fragment, or redirects to a URL with a fragment, only that element of the HTML will be parsed and returned.

Parameters:

url (str)
get_fn (callable) – matching requests.get()’s signature, for the HTTP fetch
gateway (bool) – see requests_fn()
require_backlink (str or sequence of strs) – If provided, one of these must be in the response body, in any form. Generally used for webmention validation.
metaformats (bool) – passed through to parse_mf2()
kwargs – passed through to requests.get()

Returns:

parsed mf2 data. Includes the final URL of the parsed document (after: redirects) in the top-level url field. If the url doesn’t return HTML or has can’t be parsed for microformats2, returns None.

Return type:

dict

Raises:

ValueError – if a backlink in require_backlink is not found

send_email(*, smtp_host=None, smtp_port=None, from_=None, to=None, subject=None, body=None)[source]

Sends an email via a given SMTP server.

If smtp_user and smtp_password files exist in the current directory, they’re used to log into the SMTP server.

Parameters:

smtp_host (str)
smtp_port (str, optional)
from (str)
to (str or list)
subject (str)
body (str)

d(*objs)[source]: Pretty-prints an object as JSON, for debugging.

webmention 

Webmention endpoint discovery and sending.

Spec: https://webmention.net/draft/

class Endpoint(endpoint, response)

Bases: tuple

Returned by discover.

endpoint

Type:: str

response

Type:: requests.Response

discover(url, follow_meta_refresh=False, **requests_kwargs)[source]

Discovers a URL’s webmention endpoint.

Follows up to 30 HTTP 3xx redirects, and at most one client-side HTML meta http-equiv=refresh redirects.

Parameters:

url (str)
follow_meta_refresh (bool) – whether to follow client side redirects in HTML meta http-equiv=refresh tags
requests_kwargs – passed to requests.post()

Returns:

If no endpoint is discovered, the endpoint attribute will be None.

Return type:

Endpoint

Raises:

ValueError – on bad URL
HTTPError – on failure

send(endpoint, source, target, **requests_kwargs)[source]

Sends a webmention.

Parameters:

endpoint (str) – webmention endpoint URL
source (str) – source URL
target (str) – target URL
requests_kwargs – passed to requests.post()

Returns:

on success

Return type:

Response

Raises:

ValueError – on bad URL
HTTPError – on failure