Quickly adding redirects to wagtail

Recently I needed to add a large number of redirects to wagtail according to a specific pattern, because of bad links that were created owing to some bad parsing of /feed/ URLs. Redirects in wagtail use the wagtail.contrib.redirects.models.Redirect model, and this is a quick and dirty script to automatically add them.

Full documentation for redirects can be found here.

from wagtail.contrib.redirects.models import Redirect 
urls_txt = """https://www.mycanadapayday.com/blog/4-habits-of-millionaires-canadians-should-knowfeed/
https://www.mycanadapayday.com/blog/5-creative-ways-to-save-during-holiday-travelfeed/
https://www.mycanadapayday.com/blog/5-everyday-ways-to-save-when-you-buy-online-and-in-real-timefeed/
https://www.mycanadapayday.com/blog/7-new-ways-to-pay-off-your-debt-fasterfeed/
https://www.mycanadapayday.com/blog/7-ways-you-can-rent-even-with-bad-creditfeed/
https://www.mycanadapayday.com/blog/you-can-celebrate-canada-day-on-a-family-friendly-budgetfeed/""" 
def main():
    urls = urls_txt.split('\n')
    for url in urls:
        if 'feed' not in url:
            continue
        newurl = re.sub('feed/$', '', url)
        print('redir', url, 'to', newurl)
        old_path = re.sub(r'https://www.mycanadapayday.com', '', url)
        old_path = re.sub(r'/$', '', old_path)
        existing = Redirect.objects.filter(old_path=old_path).first()
        if existing:
            print('already done', url)
        else:
            print('making new redirect for', old_path, 'to', newurl)
            redirect = Redirect(
                old_path=old_path,
                is_permanent=True,
                redirect_link=newurl,
            )
            redirect.save()

And as you can see, the old link is now redirected to this correct page on saving money.

Programmatically adding images to Wagtail

Here’s just a little code snippet to download an image using requests and upload that same image into Wagtail where it can be used in templates.

import requests
from wagtail.images.models import Image
from io import BytesIO
from django.core.files.images import ImageFile


http_res = requests.get(image_url)
title = 'my_image_file.jpg'
image_file = ImageFile(BytesIO(http_res.content), name=title)
image = Image(title=title, image_file)
image.save()

In order to insert images into a RichTextField, you’ll need to convert them into an embed. If you’ve got existing HTML with image tags, you can do this with BeautifulSoup. Assuming you’ve just kept the same file paths as the title:

from bs4 import BeautifulSoup

def replace_images_with_embeds(body):
    soup = BeautifulSoup(body, 'lxml')
    for img in soup.find_all("img"):
        print('found image:', img['src'])
        filename = re.sub(r'http://.+?/', '', img['src'])
        image = Image.objects.get(title=filename)
        embed = soup.new_tag('embed')
        embed['format'] = 'fullwidth'
        embed['id'] = image.pk
        embed['embedtype'] = 'image'
        embed['alt'] = filename
        img.parent.insert_after(embed)
        img.decompose() # remove the unused img

Taking full size screenshots using Selenium 3 with the Firefox geckodriver

Since Selenium3 switched from FirefoxDriver to geckodriver, which is now mandatory as of Firefox 47, the screenshot command now only takes a picture of the viewport.  As far as I’ve been able to tell, the only solution anyone has come up with is using something like aShot to scroll down the page and stitch a string of images together.  As it turns out though, the geckodriver system is intentionally crippled to comply with the JsonWire protocol which specifies that it should take a picture of the viewport.  Firefox itself still has the same capabilities that it did before.  To work around this, you can build a custom geckodriver.

First you need a dev environment that will build geckodriver, so use a Dockerfile containing:

FROM selenium/standalone-firefox:3.8.1
USER root
RUN apt-get update && apt-get install -y build-essential file curl
RUN apt-get install -y less python
ENV SHELL=/bin/bash
RUN apt-get install -y less python
RUN apt-get install -y autoconf2.13 build-essential ccache python-dev python-pip python-setuptools unzip uuid zip
RUN apt-get install -y libasound2-dev libcurl4-openssl-dev libdbus-1-dev libdbus-glib-1-dev libgconf2-dev libgtk-3-dev libgtk2.0-dev libiw-dev libnotify-dev libpulse-dev libx11-xcb-dev libxt-dev mesa-com
RUN pip install mercurial
ENV SSL_VERSION=1.0.2n


RUN curl https://www.openssl.org/source/openssl-$SSL_VERSION.tar.gz -O && \
 tar -xzf openssl-$SSL_VERSION.tar.gz && \
 cd openssl-$SSL_VERSION && ./config && make depend && make install && \
 cd .. && rm -rf openssl-$SSL_VERSION*

ENV OPENSSL_LIB_DIR=/usr/local/ssl/lib \
 OPENSSL_INCLUDE_DIR=/usr/local/ssl/include \
 OPENSSL_STATIC=1

# install toolchain
RUN curl https://sh.rustup.rs -sSf | \
 sh -s -- --default-toolchain stable -y
ENV PATH=/root/.cargo/bin:$PATH
USER seluser

Then you want to run that docker image with a local volume mapped, because the firefox source is huge.

docker run -u root -it -v /home/somewhere/firefox-src:/src geckodriver bash

then follow the instructions at https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions/Simple_Firefox_build/Linux_and_MacOS_build_preparation

Choose regular desktop firefox, when bootstrap.py asks you, not artifact.

wget https://hg.mozilla.org/mozilla-central/raw-file/default/python/mozboot/bin/bootstrap.py
python bootstrap.py
hg clone https://hg.mozilla.org/mozilla-central

Then hand apply this simple patch, which may need some changes depending on what they do with the code:

diff -r 4db166f0442d testing/geckodriver/src/marionette.rs
--- a/testing/geckodriver/src/marionette.rs Thu Jan 11 00:01:16 2018 +0200
+++ b/testing/geckodriver/src/marionette.rs Thu Jan 11 03:04:49 2018 +0000
@@ -1147,14 +1147,14 @@
 let mut data = BTreeMap::new();
 data.insert("id".to_string(), Json::Null);
 data.insert("highlights".to_string(), Json::Array(vec![]));
- data.insert("full".to_string(), Json::Boolean(false));
+ data.insert("full".to_string(), Json::Boolean(true));
 (Some("takeScreenshot"), Some(Ok(data)))
 },
 TakeElementScreenshot(ref e) => {
 let mut data = BTreeMap::new();
 data.insert("id".to_string(), e.id.to_json());
 data.insert("highlights".to_string(), Json::Array(vec![]));
- data.insert("full".to_string(), Json::Boolean(false));
+ data.insert("full".to_string(), Json::Boolean(true));
 (Some("takeScreenshot"), Some(Ok(data)))
 },
 Extension(ref extension) => {

put this in a file called ‘mozconfig’ to tell mach to build geckodriver, as it’s not done by default.

ac_add_options –enable-geckodriver

cd mozilla-central
echo "ac_add_options --enable-geckodriver" > mozconfig
./mach build

This results in the binary being located in ./obj-x86_64-pc-linux-gnu/dist/bin/geckodriver.  If you want to use the official selenium docker image, you’ll need to copy that over top of  /usr/bin/geckodriver — otherwise, just replace it whereever it is on your system.

AWS Rekognition: Unable to get image metadata from S3. Check object key, region and/or access permissions.

What does this Amazon Rekognition error mean: Unable to get image metadata from S3.  Check object key, region and/or access permissions.

For me, this meant that I was calling

DetectFaces(Image => { S3Object => { Bucket => “bucket_name”, Name => “path_to_image” } }

with bucket_name belonging to an S3 bucket in a different region than the Rekognition endpoint that I was using.  It turns out, Rekognition can only access buckets in its region.  Copying the data to us-west-2 resolved the problem.

Sending SMS messages with Amazon SNS and Paws

Paws is a great way to access AWS services in Perl.  Here’s how you can send an SMS:

use Paws;
my $paws = Paws->service('SNS');
$paws->Publish(
   Message => "test",
   PhoneNumber => "+1XXXXXXXXXX",
);

If you get an error message saying

Invalid parameter: PhoneNumber Reason: +1XXXXXXXXXX is not valid to publish to

This may just mean that you aren’t connecting to the right endpoint, as SMS sending through SNS is only available in 6 regions at the time of this writing.  If your default region isn’t one of those, you can specify your region for this Paws instance like so:

use Paws;
my $paws = Paws->service('SNS', region => 'us-west-2');
$paws->Publish(
   Message => "test",
   PhoneNumber => "+1XXXXXXXXXX",
);

Deduplicating SQS messages using DynamoDB in Perl

Amazon SQS is a massively scalable, highly available message queuing system.  It is capable of handling virtually unlimited amounts of traffic, but it has one particularly awkward issue: duplicate messages.  They are relatively infrequent, but they can have serious consequences, ie. debiting a bank account a second time.

SQS does provide a solution to this in the form of FIFO queues, but they have some drawbacks.  First, they are limited to about 300 messages/second (which may not be an issue for everyone), but more importantly they simply aren’t available in many regions.  At the time of this writing they are only operational in US West and US East.

If you have higher performance requirements or you aren’t lucky enough to be in a region that supports it, you still have options.  One solution is to architect your code such that the operations are idempotent, meaning they can execute multiple times without causing problems.  If the task is doing something like refreshing a cached webpage, you might not have to even consider idempotency as an infrequent duplicate refresh of a web page may be inconsequential.

On the other hand, sometimes operations absolutely must never happen twice, or they can be so computationally expensive that avoiding them is critical.  One of the easiest ways to achieve this is to simply assign a GUID to each message and deduplicate them.  The solution I have used is to combine DynamoDB and SQS.  DynamoDB is also highly scalable, with default capacity limits per table of 40,000 reads and writes per second.

To do this in Perl, I’m using Paws.  For clarity, I’m skipping error handling and retry logic.

First, we have to create a table in DynamoDB.  My structure is simple, a key string column named job_id and an integer column called taken.

store a record in DynamoDB containing the GUID, and an attribute indicating that the job hasn’t been taken.

use Data::UUID;
use Paws;
use JSON::XS;
# store the job id in DynamoDB
my $job_id = Data::UUID->new->create_str;
my $dynamodb = Paws->service('DynamoDB', region => 'us-west-2');
$dynamodb->PutItem(
   TableName => "sqs_job",
   Item => {
      job_id => {
         S => $job_id,
      },
      taken => {
         N => 0,
      },
   job_result => { S => " " },
});

# send the message
my $sqs = Paws->service('SQS', region => 'us-west-2');
my $queue_name = "my-queue-name";
my $queue_url = $sqs->GetQueueUrl(QueueName => $queue_name)->QueueUrl;
$sqs->SendMessage(MessageBody => encode_json({ job_id => $job_id }), QueueUrl => $queue_url);

In the receiver we use a conditional update expression to set taken=1 if and only if taken=0.  Otherwise, we exit and avoid duplicate processing.

my $recv = $sqs->ReceiveMessage(QueueUrl => $queue_url, MaxNumberOfMessages => 1, WaitTimeSeconds => $timeout);
foreach my $raw_message (@{$recv->Messages}) {
   my $message = decode_json($raw_message->{Body});
# ensure that SQS doesn't resend it by deleting it
   $sqs->DeleteMessage(QueueUrl => $queue_url, ReceiptHandle =>  $raw_message->{ReceiptHandle});
   # get the job_id from the payload (this could be done with a message attribute instead)
   my $job_id = $message->{job_id};
   eval {
      my $res = Screencap::DynamoDB->Get->UpdateItem(
         TableName => "sqs_job",
         Key => {
            job_id => {
               S => $job_id,
            },
         },
         ConditionExpression => "taken = :zero", # only update if taken=0
         UpdateExpression => "SET taken = :one", # set taken=1
         ExpressionAttributeValues => {
            ":zero" => {
               N => 0,
            },
            ":one" => {
               N => 1,
            },
         },
      );
   }
   if ($@ =~ /^The conditional request failed/) { # this is the error message when a conditional expression fails, if it does we skip processing because we've already handled it
      next;
   }
   # handle the message
}

And that’s it.  You’ll want to add a lot of different error handling for this to be very reliable, though.

CPAN PR Challenge – Text::BibTex

This month I ended up with Text::BibTex.  Conveniently, this distro came with an actual bug that needed fixing.

This, however, was not a Perl problem at all but rather a C problem.  The module was causing an actual crash with this stack trace:

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f9f39cc5f37] /lib/x86_64-linux-gnu/libc.so.6(+0xebdf0)[0x7f9f39cc4df0] /lib/x86_64-linux-gnu/libc.so.6(+0xead37)[0x7f9f39cc3d37] /usr/lib/libbtparse.so.1(zzFAIL+0xe4)[0x7f9f38eabdc4] /usr/lib/libbtparse.so.1(body+0xdf)[0x7f9f38eab56f] /usr/lib/libbtparse.so.1(entry+0x1ea)[0x7f9f38eab98a] /usr/lib/libbtparse.so.1(bt_parse_entry+0x100)[0x7f9f38ea9d40] /usr/lib/perl5/auto/Text/BibTeX/BibTeX.so(XS_Text__BibTeX__Entry__parse+0x 135)[0x7f9f390c1a45] /usr/lib/libperl.so.5.14(Perl_pp_entersub+0x58c)[0x7f9f3a6ba3cc] /usr/lib/libperl.so.5.14(Perl_runops_standard+0x16)[0x7f9f3a6b19a6] /usr/lib/libperl.so.5.14(perl_run+0x3a5)[0x7f9f3a6535b5] /usr/bin/perl(main+0x149)[0x400f89] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f9f39bf7ead] /usr/bin/perl[0x400fc1]

I don’t actually know anything about BibTex, so the first thing I did was strip down the problem data to the smallest thing that would still break and make a test case.  As it turned out, the smallest thing was about 2000 bytes of any text inside a comment that wasn’t closed.  This was a bit of a clue:

./btparse/pccts/antlr.h:74:#define ZZLEXBUFSIZE 2000

The culprit in this case was an improper use of strncat() in this block:

#ifdef LL_K
 static char text[LL_K*ZZLEXBUFSIZE+1];
 SetWordType *f[LL_K];
#else
 static char text[ZZLEXBUFSIZE+1];
 SetWordType *f[1];
#endif
 SetWordType **miss_set;
 char **miss_text;
 int *bad_tok;
 char **bad_text;
 int *err_k;
 int i;
 va_list ap;
#ifndef __USE_PROTOS
 int k;
#endif
#ifdef __USE_PROTOS
 va_start(ap, k);
#else
 va_start(ap);
 k = va_arg(ap, int); /* how many lookahead sets? */
#endif
 text[0] = '\0';
 for (i=1; i<=k; i++) /* collect all lookahead sets */
 {
 f[i-1] = va_arg(ap, SetWordType *);
 }
 for (i=1; i<=k; i++) /* look for offending token */
 {
#ifdef LL_K
 int freeSpace = (LL_K*ZZLEXBUFSIZE+1) - strlen(text);
#else
 int freeSpace = (ZZLEXBUFSIZE+1) - strlen(text);
#endif
 if ( i>1 ) strcat(text, " ");
 strncat(text, LATEXT(i), freeSpace);
 if ( !zzset_el((unsigned)LA(i), f[i-1]) ) break;
 }

strncat(a,b,n) looks like it safely avoids overflowing the destination buffer but in fact according to its specification:

 If src contains n or more bytes, strncat() writes n+1 bytes to dest (n from src plus the terminating null byte). Therefore, the size of dest must be at least strlen(dest)+n+1.

So to fix, we just allocate more space:

#ifdef LL_K
 static char text[LL_K*ZZLEXBUFSIZE+1+1];
 SetWordType *f[LL_K];
#else
 static char text[ZZLEXBUFSIZE+1+1];
 SetWordType *f[1];
#endif