Pages

Friday, April 19, 2024

EMA Eudravigilance - Dashboard and collector tool


In the recent interview with Jim Ferguson, we discussed a Tableau dashboard I had generated from pharmacovigilance data from the European Medicines Agency (https://www.adrreports.eu/).



For those who wish to download the data, below is a Python script I developed with Twan van der Poel (https://vanderpoelsoftware.com/).

The tool downloads everything, both serious and non-serious line listings and both products and substances.

Later I will post the shell scripts and SQL statements to process the CSV files into PostgreSQL

If you have questions, pls feel free to leave a comment.

Download python script here: https://drive.google.com/file/d/1Zs8HiB3W-bd77OspdTrnhb8WiXsskfTM/view?usp=sharing

The Tableau dashboard is work in progress but can be downloaded here: https://drive.google.com/file/d/16RRvQJZ6VEyAF_Gm4DDPtE200esXTlSM/view?usp=share_link

(download a 14d trial of Tableau to open the dashboard)

import errno, json, hashlib, logging, pprint, urllib.request, re, requests, os, ssl, sys, time, threading, queue

from alive_progress import alive_bar;
from concurrent.futures import ThreadPoolExecutor
from itertools import product
from pathlib import Path
from multiprocessing.pool import ThreadPool
from urllib.parse import urlparse, unquote, urlencode
from requests.utils import requote_uri
from termcolor import colored
"""
# EMA Crawler
Crawls data from adrreports.eu and converts it into
data CSV and TXT files. The script operates from
the working directory and will create a "temp/", "output/" and "downloads/" subdirectory.
The "temp/" subdirectory will contain remote cache files and
an "output/" directory, the latter will contain the output files.
If downloads are performed the downloads are stored in a "downloads/" directory, this
directory holds hash references to keep track of what it's origins.
* Authors: In alphabetical order:
*          Twan van der Poel <twan@vanderpoelsoftware.com>
*          Wouter Aukema <waukema@gmail.com>
* Usage:   python3 ema-crawler.py
* Version: 1.98
# Changelog
## 1.98
* Used new session for each executor.map iteration
* Further integration of multithreading
* Temporary removal of hash-subdirectory
* Added stopwatch
## 1.97
* Dynamic non-configurable download directory
* Recycled session for each .get call
* Removed last bits of BIN_CHROMEDRIVER
* Dedicated function for remote calls
* Further integration of remote HTTP errors
* Resolved several encoding issues
## 1.96
* Simplified target directory based on files' md5 hash
* Wrapped executor in progress-bar
* Added parameter VERBOSE_DEBUGGING
## 1.95
* Added setting for verbose chromedriver logging
* Removed Selenium approach
* Integrated requests.Session()
## 1.94
* Integrated dynamic download storage
* Feature which moves downloads to their final location
* Added setting USE_DOWNLOADS_CACHE
## 1.93
* Integrated generic fatal_shutdown handler
## 1.92
* Headless support for Chromedriver
* Integrated storage filenames in list of downloads
* Integrated filters to capture downloads of more than 250K of lines properly
* Added parameter "MAX_PRODUCT_COUNT" for testing small chunks
* Dynamic construction of URL (incl. filters)
* Cache integration in download directory
* Dynamic construction of filter combinations for Covid related HLC's
## Todo
* Replace hard-coded ID's with detection of 250k limitation
"""
# Settings
MAX_PRODUCT_COUNT    = None
MAX_THREADS          = 20
USE_DOWNLOADS_CACHE  = False
VERBOSE_DEBUGGING    = False
connect_timeout: int = 40
download_timeout: int = 100

# Variables used to split up large downloads
variables = {
    'Seriousness':                                   'Serious',
    'Regulatory Primary Source Country EEA/Non EEA': 'European Economic Area',
    'Primary Source Qualification':                  'Healthcare Professional',
    'Sex':                                           'Female',
}
# Years used to split up large downloads
years = {
    '2017': '2017',
    '2018': '2018',
    '2019': '2019',
    '2020': '2020',
    '2021': '2021',
    '2022': '2022',
    '2023': '2023',
    '2024': '2024',
}
# Timing
executionStart = time.time()
print("Version 1.98")
# Function which handles fatal shutdowns
def fatal_shutdown(errorMessage, identifier):
    print(colored('+--------------------------------------------------+', 'red'))
    print(colored('| EMA crawler crashed, please read the error below |', 'red'))
    print(colored('+--------------------------------------------------+', 'red'))
    print(colored(errorMessage, 'yellow')+" "+colored("("+identifier+")", "white"))
    sys.exit()
# Function which debugs verbose logging (if enabled)
def verbose_debug(logMessage):
    if not VERBOSE_DEBUGGING:
        return
    print(logMessage)
# Helper to exit execution
def exit():
    sys.exit()
# Determine downloads directory
downloadsPath = ("%s%sdl" % (os.getcwd(), os.sep))
# Default EMA DPA base URL
emaBaseUrl = 'https://dap.ema.europa.eu/analytics/saw.dll?Go'
# Prepare thread storage
threadStorage = threading.local()
# Pre-perform validation
if MAX_PRODUCT_COUNT is not None and not isinstance(MAX_PRODUCT_COUNT, int):
    fatal_shutdown("MAX_PRODUCT_COUNT must be one of; numeric, None", "V01")
if not isinstance(MAX_THREADS, int) or MAX_THREADS < 1:
    fatal_shutdown("MAX_THREADS must be numeric, at least 1", "V02")
if not os.path.exists(downloadsPath):
    os.mkdir(downloadsPath)
if not os.path.exists(downloadsPath):
    fatal_shutdown("Downloads directory does not exist (and could not be created)", "V03")
# Ensure working directories
basepath = os.path.dirname(os.path.abspath(sys.argv[0]))+os.sep
workdirs = ['temp'+os.sep, 'output'+os.sep, 'downloads'+os.sep]
for workdir in workdirs:
    fullpath = basepath+workdir
    if not os.path.exists(fullpath):
        os.mkdir(fullpath)
    if not os.access(fullpath, os.W_OK):
        fatal_shutdown(("Working directory %s is not writable" % fullpath), "P01")
# Define important directories
CACHEDIR = basepath+workdirs[0]
OUTPUTDIR = basepath+workdirs[1]
# Function which creates a directory
def create_directory(directory):
    try:
        os.mkdir(directory)
    except OSError as exc:
        if exc.errno != errno.EEXIST:
            raise
        pass
# Function which initiates the thread executor
def initiate_thread_executor(resources):
    # Determine index
    threadStorage.threadIndex = resources.get(False)
    # Open initial session and authenticate
    threadStorage.localSession = requests.Session()
    # Prepare params
    params = {
        'Path':      '/shared/PHV DAP/DAP/Run Line Listing Report',
        'Format':    'txt',
        'Extension': '.csv',
        'Action':    'Extract',
        'P0':         0
    }
    # Initial request
    ema_get_remote(threadStorage.localSession, params)
# Function which fetches an URL and caches it by a hash
def fetch_remote_url(url):
    hash = hashlib.md5(url.encode()).hexdigest()
    cacheFile = CACHEDIR+hash+".html"
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    if os.path.exists(cacheFile):
        verbose_debug("Fetching local from %s" % url)
        contents = Path(cacheFile).read_text()
        return contents.encode('utf-8')
    else:
        verbose_debug("Fetching remote from %s" % url)
        try:
            with urllib.request.urlopen(url, context=ctx) as response:
                content = response.read()
            if not content:
                fatal_shutdown("Unable to fetch remote data from '%s'" % url, "R01")
        except:
            verbose_debug("Remote data is empty '%s'" % url)
            content = "".encode('utf-8')
        f = open(cacheFile, "wb")
        f.write(content)
        f.close()
        return fetch_remote_url(url)
# Function which creates BI filtered URL's
def create_filtered_url(combination):
    filters = combination['filters']
    params = {}
    params['Path']      = '/shared/PHV DAP/DAP/Run Line Listing Report'
    params['Format']    = 'txt'
    params['Extension'] = '.csv'
    params['Action']    = 'Extract'
    paramCounter = 0
    if len(filters) > 0:
        keyName = ('P%d' % paramCounter)
        params[keyName] = len(filters)
        paramCounter = paramCounter + 1
        for filter in filters:
            keyName = ('P%d' % paramCounter)
            params[keyName] = filter[0]
            paramCounter = paramCounter + 1
            keyName = ('P%d' % paramCounter)
            familyName = ('"Line Listing Objects"."%s"' % filter[1])
            params[keyName] = familyName
            paramCounter = paramCounter + 1
            keyName = ('P%d' % paramCounter)
            params[keyName] = (("1 %s" % filter[2]))
            paramCounter = paramCounter + 1
    baseUrl = emaBaseUrl
    url = baseUrl
    for property in params:
        value = params[property]
        params[property] = value
        url += ('&%s=%s' % (property, params[property]))
    return {
        'baseUrl': baseUrl,
        'params': params,
        'url': url
    }
# Function which calls remote with the right session and headers
def ema_get_remote(localSession, localParams):
    # Prepare statics
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    errorPattern = r'<div class="ErrorMessage">(.*?)</div>'
    # Perform request
    response = localSession.get(emaBaseUrl, params=localParams, headers=headers, timeout=(connect_timeout, download_timeout))
    # Non 200 code?
    if response.status_code != 200:
        print(localParams)
        fatal_shutdown(f"Non remote 200 HTTP code, got {response.status_code}", "EGR01")
    # Explicitly specify the encoding if needed
    responseText = response.content.decode('utf-16', errors='replace')
    # Extract remote error?
    match = re.search(errorPattern, responseText, re.DOTALL)
    if match:
        print(localParams)
        fatal_shutdown(("Remote URL got error-response: %s" % match.group(1)), "EGR02")
    # Here you go, sir.
    return responseText
# Function which process compiled (downloadable) items
def process_compiled_item(item):
    # Prepare target
    hash = hashlib.md5(item['DownloadUrl'].encode()).hexdigest()
    #targetDirectory = "%s%s%s" % (downloadsPath, os.sep, hash)
    targetDirectory = "%s%s" % (downloadsPath, os.sep)
    targetFilename = "%s%s%s" % (targetDirectory, os.sep, item['OutputFile'])
    # Apply cache?
    if USE_DOWNLOADS_CACHE and os.path.exists(targetFilename):
        if not VERBOSE_DEBUGGING:
            bar()
        return
    # Create target directory
    create_directory(targetDirectory)
    # Debug
    verbose_debug("\nProcessing download")
    verbose_debug("-------------------")
    verbose_debug("-> HLC:  "+item['HighLevelCode'])
    verbose_debug("-> URL:  "+item['DownloadUrl'])
    verbose_debug("-> Into: "+targetDirectory)
    verbose_debug("-> File: "+targetFilename)
    # Perform the request
    responseText = ema_get_remote(threadStorage.localSession, item['Params'])
    # Write response body to file
    with open(targetFilename, 'w', encoding='utf-8') as file:
        file.write(responseText)
    verbose_debug("-> READY")
    if not VERBOSE_DEBUGGING:
        bar()
# Function which generates a matrix
def generate_matrix(variables):
    options = [[]]
    for name, value in variables.items():
        newOptions = []
        for option in options:
            newOptions.append(option + [f"{name}={value}"])
            newOptions.append(option + [f"{name}!={value}"])
        options = newOptions
    return options
# Fetch product- and substances list
collection = []
productCounter = 0
breakIteration = False
baseUrls = {}
baseUrls['products']   = 'https://www.adrreports.eu/tables/product/'
baseUrls['substances'] = 'https://www.adrreports.eu/tables/substance/'
chars = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0-9']
for type in baseUrls:
    baseUrl = baseUrls[type]
    baseType = type[:-1].capitalize()
    for char in chars:
        indexUrl = "%s%s.html" % (baseUrl, char)
        content = fetch_remote_url(indexUrl)
        pattern = r"<a\s+.*?</a>"
        matches = re.findall(pattern, str(content))
        for match in matches:
            # Extract CA
            centrallyAuthorized = "1"
            if type == 'substances':
                centrallyAuthorized = "0"
                if re.search(r"\*", match):
                    centrallyAuthorized = "1"
            # Extract URL
            words = match.strip().split('"')
            url = words[1]
            # Extract code
            parsed = urlparse(unquote(url))
            words = parsed.query.split('+')
            code = words[-1]
            # Extract name
            parsed = match.split(">")
            parsed = parsed[1].split("<")
            name = parsed[0]
            # Construct filters:
            #   First parameter is the total count of parameters
            #   Then, groups of 3 parameters are added
            #   Parameters above are respectively: operator, var1, var2 or value
            combinations = []
            # Default definition
            defaultFilters = []
            defaultFilters.append(['eq', baseType+' High Level Code', code])
            defaultFilters.append(['eq', 'Seriousness', 'Serious'])
            # Prepare outputfile
            filterString = ""
            # For Covid related HLC's we chop up the result because of a 250K line listing limit)
            if type == 'substances' and code in ['40983312', '40995439', '42325700', '21755']:
                # Generate matrix
                matrix = generate_matrix(variables)
                # Construct combo
                comboCounter = 0
                for year in years:
                    for square in matrix:
                        filterString = ""
                        filters = defaultFilters.copy()
                        filters.append(['eq', 'Gateway Year', year])
                        for point in square:
                            if re.search('!=', point):
                                variable, value = point.split('!=')
                                operator = 'neq'
                                filterString += "_Not"
                            else:
                                variable, value = point.split('=')
                                operator = 'eq'
                            filters.append([operator, variable, value])
                            filterString += ("_%s" % value)
                        comboCounter = comboCounter + 1
                        outputFile = ("%s_%s_%s%s.csv" % (type, code, year, filterString))
                        targetfile = downloadsPath + '/' + outputFile
                        if not os.path.exists(targetfile):
                            #print(f"{outputFile} exists")
                            #print(f"Working on {outputFile}")
                            combinations.append({'filters': filters, 'outputFile': outputFile, 'code': code, 'type': type})
            else:
                # Default ID match filter
                filters = defaultFilters.copy()
                outputFile = ("%s_%s.csv" % (type, code))
                targetfile = downloadsPath + '/' + outputFile
                if not os.path.exists(targetfile):
                    #print(f"Working on {outputFile}")
                    combinations.append({'filters': filters, 'outputFile': outputFile, 'code': code, 'type': type})
            # Raise product-counter
            productCounter += 1
            # Run every combination
            for combination in combinations:
                # Create URL
                result = create_filtered_url(combination)
                # Replace placeholders
                downloadUrl = result['url']
                downloadUrl = downloadUrl.replace('$hlc', code)
                downloadUrl = downloadUrl.replace('$type', type)
                # Collection storage
                item = {}
                item['HighLevelCode'] = code
                item['ProductName'] = name
                item['CentrallyAuthorized'] = centrallyAuthorized
                item['Type'] = type
                item['BaseUrl'] = result['baseUrl']
                item['IndexUrl'] = indexUrl
                item['DownloadUrl'] = downloadUrl
                item['OutputFile'] = combination['outputFile']
                item['Filters'] = combination['filters']
                item['Params'] = result['params']
                collection.append(item)
                #print(f"HLC: {item['HighLevelCode']}")
                #hier wordt dus de lijst met HLCs opgetuigd
            # Max product counter
            if MAX_PRODUCT_COUNT and productCounter == MAX_PRODUCT_COUNT:
                verbose_debug(("--> Warning: MAX_PRODUCT_COUNT reached = %d" % MAX_PRODUCT_COUNT))
                breakIteration = True
                break
        if breakIteration:
            break
    if breakIteration:
        break
# Preformat output
output = {}
for item in collection:
    type = item['Type']
    if not type in output.keys():
        output[type] = []
    output[type].append([str(item['HighLevelCode']), str(item['ProductName']), str(item['CentrallyAuthorized'])])
# Generate indexed output files
for type in output:
    resultFile = "%s%s.txt" % (OUTPUTDIR, type)
    verbose_debug("Generating output file %s" % resultFile)
    if os.path.exists(resultFile):
        os.remove(resultFile)
    lines = output[type]
    content = ""
    for line in lines:
        content += "\t".join(line)+"\n"
    f = open(resultFile, "wb")
    f.write(content.encode('utf-8'))
    f.close()
# Debug
resultFile = "%surls.txt" % (OUTPUTDIR)
if os.path.exists(resultFile):
    os.remove(resultFile)
verbose_debug("Generating URL output file %s" % resultFile)
content = ""
for item in collection:
    content += "\n"+item['OutputFile']+"\n"+item['DownloadUrl']+"\n"
f = open(resultFile, "wb")
f.write(content.encode('utf-8'))
f.close()
# Run executor
if VERBOSE_DEBUGGING:
    resources = queue.Queue(MAX_THREADS)
    for threadIndex in range(MAX_THREADS):
        resources.put(threadIndex, False)
    with ThreadPool(MAX_THREADS, initiate_thread_executor, [resources]) as pool:
        pool.map(process_compiled_item, collection)
else:
    collectionLength = len(collection)
    with alive_bar(collectionLength) as bar:
        resources = queue.Queue(MAX_THREADS)
        for threadIndex in range(MAX_THREADS):
            resources.put(threadIndex, False)
        with ThreadPool(MAX_THREADS, initiate_thread_executor, [resources]) as pool:
            pool.map(process_compiled_item, collection)
# Fin
if MAX_PRODUCT_COUNT != None:
    print(f"Requested {MAX_PRODUCT_COUNT} files")
runtime = round(time.time() - executionStart)
print(f"Execution finished in ~ {runtime} seconds")

Friday, June 25, 2021

Loading VAERS data into Postgres

For those who want to load VAERS data into Postgres, here's how I did it.

In case you have technical questions -> twitter: @waukema. 

/*

DROP TABLE vaers.symptoms;


CREATE TABLE vaers.symptoms

(

  id serial NOT NULL,

  vaers_id char(7),

  symptom1 character varying,

  symptomversion1 numeric,

  symptom2 character varying,

  symptomversion2 numeric,

  symptom3 character varying,

  symptomversion3 numeric,

  symptom4 character varying,

  symptomversion4 numeric,

  symptom5 character varying,

  symptomversion5 numeric

)

WITH (

  OIDS=FALSE

);



--msdos windows server 2008 postgres 9.3 using gnuwin32 utils:

--cat 2021VAERSSYMPTOMS.csv|psql -h localhost -d postgres -a -c "set client_encoding=latin1; copy vaers.symptoms (VAERS_ID, SYMPTOM1, SYMPTOMVERSION1, SYMPTOM2, SYMPTOMVERSION2, SYMPTOM3, SYMPTOMVERSION3, SYMPTOM4, SYMPTOMVERSION4, SYMPTOM5, SYMPTOMVERSION5) from stdin delimiter ',' CSV HEADER"


create table vaers.allsymptoms as

(

select 

vaers_id,

symptom1 as symptom,

symptomversion1 as symptomversion

from 

vaers.symptoms

where

symptom1 is not null

union all

select 

vaers_id,

symptom2 as symptom,

symptomversion2 as symptomversion

from 

vaers.symptoms

where

symptom2 is not null

union all

select 

vaers_id,

symptom3 as symptom,

symptomversion3 as symptomversion

from 

vaers.symptoms

where

symptom3 is not null

union all

select 

vaers_id,

symptom4 as symptom,

symptomversion4 as symptomversion

from 

vaers.symptoms

where

symptom4 is not null

union all

select 

vaers_id,

symptom5 as symptom,

symptomversion5 as symptomversion

from 

vaers.symptoms

where

symptom5 is not null

);


create index ix_allsymptoms_vaers_id on vaers.allsymptoms using btree (vaers_id);

create index ix_allsymptoms_symptom on vaers.allsymptoms using btree (symptom);


drop table vaers.vax ;

create table vaers.vax 

(

id serial,

VAERS_ID varchar,

VAX_TYPE varchar,

VAX_MANU varchar,

VAX_LOT varchar,

VAX_DOSE_SERIES varchar,

VAX_ROUTE varchar,

VAX_SITE varchar,

VAX_NAME varchar

)

WITH (

  OIDS=FALSE

);


--msdos postgres 9.3

--cat 2021VAERSSYMPTOMS.csv|psql -h localhost -d postgres -a -c "set client_encoding=latin1; copy vaers.symptoms (VAERS_ID, SYMPTOM1, SYMPTOMVERSION1, SYMPTOM2, SYMPTOMVERSION2, SYMPTOM3, SYMPTOMVERSION3, SYMPTOM4, SYMPTOMVERSION4, SYMPTOM5, SYMPTOMVERSION5) from stdin delimiter ',' CSV HEADER"

--cat 2021VAERSVAX.csv|psql -h localhost -d postgres -a -c "copy vaers.vax (VAERS_ID, VAX_TYPE, VAX_MANU, VAX_LOT, VAX_DOSE_SERIES, VAX_ROUTE, VAX_SITE, VAX_NAME) from stdin delimiter ',' CSV HEADER"


create index ix_vax_vaers_id on vaers.vax using btree (vaers_id);


drop table if exists vaers.data;

create table vaers.data

(

id serial,

VAERS_ID varchar,

RECVDATE varchar,

STATE varchar,

AGE_YRS varchar,

CAGE_YR varchar,

CAGE_MO varchar,

SEX varchar,

RPT_DATE varchar,

SYMPTOM_TEXT varchar,

DIED varchar,

DATEDIED varchar,

L_THREAT varchar,

ER_VISIT varchar,

HOSPITAL varchar,

HOSPDAYS varchar,

X_STAY varchar,

DISABLE varchar,

RECOVD varchar,

VAX_DATE varchar,

ONSET_DATE varchar,

NUMDAYS varchar,

LAB_DATA varchar,

V_ADMINBY varchar,

V_FUNDBY varchar,

OTHER_MEDS varchar,

CUR_ILL varchar,

HISTORY varchar,

PRIOR_VAX varchar,

SPLTTYPE varchar,

FORM_VERS varchar,

TODAYS_DATE varchar,

BIRTH_DEFECT varchar,

OFC_VISIT varchar,

ER_ED_VISIT varchar,

ALLERGIES varchar

)

WITH (

  OIDS=FALSE

);


--msdos postgres 9.3

--cat 2021VAERSDATA.csv|psql -h localhost -d postgres -a -c "set client_encoding=latin1; copy vaers.data (VAERS_ID,RECVDATE,STATE,AGE_YRS,CAGE_YR,CAGE_MO,SEX,RPT_DATE,SYMPTOM_TEXT,DIED,DATEDIED,L_THREAT,ER_VISIT,HOSPITAL,HOSPDAYS,X_STAY,DISABLE,RECOVD,VAX_DATE,ONSET_DATE,NUMDAYS,LAB_DATA,V_ADMINBY,V_FUNDBY,OTHER_MEDS,CUR_ILL,HISTORY,PRIOR_VAX,SPLTTYPE,FORM_VERS,TODAYS_DATE,BIRTH_DEFECT,OFC_VISIT,ER_ED_VISIT,ALLERGIES) from stdin delimiter ',' CSV HEADER"

create index ix_data_vaers_id on vaers.data using btree (vaers_id);

*/



select 

vv.VAX_MANU as manufaturer,

vv.VAX_NAME as vax_name,

VAX_TYPE,

RECVDATE,

count(distinct vd.vaers_id) as num_records

from 

vaers.data vd

left join vaers.allsymptoms vs on vs.vaers_id = vd.vaers_id

left join vaers.vax vv on vv.vaers_id = vd.vaers_id

where 

vs.symptom = 'Myocard%' 

--and not vd.died = 'Y'

group by

vv.VAX_MANU,

vv.VAX_NAME,

VAX_TYPE,

RECVDATE

;


-- example:


WITH D as

(

select 

vd.recvdate,

vd.state,

vd.age_yrs,

vd.sex,

vd.recovd,

vd.died,

vd.l_threat,

vd.hospital,

vd.disable,

vd.ofc_visit,

vs.symptom,

vv.vax_manu,

vv.vax_type,

vd.vaers_id

from 

vaers.data vd

left join vaers.allsymptoms vs on vs.vaers_id = vd.vaers_id

left join vaers.vax vv on vv.vaers_id = vd.vaers_id

where

vs.symptom_text ilike '%myocardi%'

)


select 

*

from D

limit 10

Tuesday, April 20, 2021

Analysis of ICSR reports at ema.europa.eu

Technical documentation to the article titled:

"The Costs of Covid-19 Vaccinations – Should we Rethink the Policy?"

This post provides documention for the analysis process that was performed with the abovementioned article (this link), as well as some additional visualisations produced from the data downloaded from EMA and ECDC.

Vaccination Volume Data from ECDC

The data for COVID-19 vaccination volumes can be found here:

https://www.ecdc.europa.eu/en/publications-data/data-covid-19-vaccination-eu-eea   


Individual Case Safety Reports from EMA

1. To download all ICSR reports, follow the steps below.

Open http://www.adrreports.eu/en/search_subst.html, navigate to 'C' and search for COVID-19.




 





To download ICSR reports for ALL forms of medication, one would need to visis the 4,000+ entries listed. As a workaround, I have a shortcut described below.

Since EMA does not allow the public to download all ICSR reports for all forms of medication in one go, I came up with a work around, by manipulating the url for entering their Oracle back end. By simply not specifying the identifier for a parrticular substance, all ICSR reports on all forms of medication get shown.

Shortcut to all Substances (incl. COVID19-vaccines)

Generate and download the ICSR Line listings for COVID, Measles and Influenza vaccines as well as the shortcut for all forms of medication, and repeat steps 2, 3 and 4 explained below, for each of the vaccines:

2. Export the ICSR reports to TAB-delimited text files as follows:

Navigate to the tab labelled 'Line Listing'.







For Geographic Region, select 'European Economic Area'.

Next, click on 'Run Line Listing Report'. This will open a new window and present you with the ICSR reports in tabular format. This may take a while.

Navigate all the way to the bottom and start the export as shown in the image below:











The export will be saved to the default Downloads folder of your web-browser.

3. Convert and rename all exported files to UTF-8 encoding (e.g. using Notepad++ for Windows).

In the instructions below, I used a naming convention as follows: MMMDD_export_[substance].csv

4. Load the exported file into PostgreSQL

Create a schema and table in PostgreSQL (I use an old version 9.1 on an old Windows Server 2008)

CREATE SCHEMA ema;
CREATE TABLE ema.sta_icsr
(
  source varchar,
  elocnr character varying,
  reporttype character varying,
  evgatewayrecdate character varying,
  primarysourcequalification character varying,
  primarysourcecountryforregulatorypurposes character varying,
  literaturereference character varying,
  patientagegroup character varying,
  patientagegroupperreporter character varying,
  parentchildreport character varying,
  patientsex character varying,
  reactionlist character varying,
  suspectdruglist character varying,
  concomitant character varying,
  icsrform character varying
)
WITH (
  OIDS=FALSE);

Next, load all ICSR exports into the table. I used the following DOS command:

for /F %f in ('dir/b APR13_*.csv') do tail -n +2 %f|sed -e s#^^#%f\t#|sed -e s#\xED\xAF\x80#t#g -e s#\xED\xB6\xA9#t#g -e s#\xed\xb6\x9f#i#g -e s#\\#\\\\#g|psql -h localhost -U postgres -a -c "copy ema.sta_icsr from stdin"

Description of the DOS command: For each export file (%f), skip header, add a column identifying the filename, replace some unreadable characters, prepend backslashes with a backslash and tell psql to execute the copy command.

5. De-duplication of records

To de-duplicate, we need a table that holds only unique entries. Also, we need to give priority / precedence to specific substances, before we load the rest (all substances thru that manipulated url).

create table ema.dwh_geneloc (
id serial,
source varchar,
elocnr varchar
)
without oids;

insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%pfizer%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%astrazeneca%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%moderna%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%janssen%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%MMR%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%Influenza-split-virion-inactivated%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%influenza-live-nasal%'
;
insert into ema.dwh_geneloc (source, elocnr)
select
source, elocnr 
from 
ema.sta_icsr
where 
not (elocnr) in (select elocnr from ema.dwh_geneloc group by elocnr)
and source like '%all_all%'
;
6. Create some indexes to speed up query performance

create index ix_staicsr_source on ema.sta_icsr using btree (source);
create index ix_staicsr_elocnr on ema.sta_icsr using btree (elocnr);
create index ix_geneloc_source on ema.dwh_geneloc using btree (source);
create index ix_geneloc_elocnr on ema.dwh_geneloc using btree (elocnr);

7. Run query and export results to a text file
select
i.source,
split_part(split_part(i.source, 'apr13_', 2), '_', 1) as product,
i.elocnr,
date(to_timestamp(EVGwayRecDate, 'yyyy-mm-dd')) as recvd_date,
patientagegroup,
patientsex,
case 
when primarysourcequalification like 'Non Health%' then false
else true
end as hcp_reported,
replace(split_part(unnest(string_to_array(reactionlist, ',<BR><BR>')), E'\x20\(', 1), '"', '') as Adr_Text,
split_part(replace(split_part(unnest(string_to_array(reactionlist, ',<BR><BR>')), E'\x20\(', 2), '"', ''), ' - ', 2) as Adr_Outcome
from
ema.dwh_geneloc e
join ema.sta_icsr i on e.source = i.source and e.elocnr = i.elocnr

Save the query output to disk, so that you can pick ip up with a BI-tool, such as Tableau.

8. To select only Fatal reactions, this query will do the job:
WITH all_icsr AS
(
select
i.source,
split_part(split_part(i.source, 'apr13_', 2), '_', 1) as product,
i.elocnr,
date(to_timestamp(EVGwayRecDate, 'yyyy-mm-dd')) as recvd_date,
patientagegroup,
patientsex,
case 
when primarysourcequalification like 'Non Health%' then false
else true
end as hcp_reported,
replace(split_part(unnest(string_to_array(reactionlist, ',<BR><BR>')), E'\x20\(', 1), '"', '') as Adr_Text,
split_part(replace(split_part(unnest(string_to_array(reactionlist, ',<BR><BR>')), E'\x20\(', 2), '"', ''), ' - ', 2) as Adr_Outcome
from
ema.dwh_geneloc e
join ema.sta_icsr i on e.source = i.source and e.elocnr = i.elocnr
)

select
    *
from
    all_icsr
where
    adr_outcome ilike '%fatal%'

9. Beware of difference between ICSR and ADR

Within EMA, an ICSR report relates to a patient / person.
An ICSR report can record multiple reactions and corresponding outcomes.

In counting the number of deaths per 100K vaccines, you should therefore not count number of reactions with fatal outcome, as this could result in multiple fatal reactions per patient.

In the article, we have used the distinct count of unique ICSR reports. 
The query below illustrates the difference between the two measures:

with all_reports as 
(
select
i.elocnr,
split_part(split_part(i.source, 'apr13_', 2), '_', 1) as product,
split_part(replace(split_part(unnest(string_to_array(reactionlist, ',<BR><BR>')), E'\x20\(', 2), '"', ''), ' - ', 2) as adr_outcome
from
ema.dwh_geneloc e
join ema.sta_icsr i on e.source = i.source and e.elocnr = i.elocnr
)
select 
product,
count(*) as num_adr,
count(distinct elocnr) as num_icsr
from 
all_reports
where 
not product = 'all'
adr_outcome ilike '%fatal%'
group by
product



Tuesday, December 8, 2020

Meta-data Analysis at eurosurveillance.org

Background

The goal is to understand how much time it typically takes for Research papers to get reviewed and accepted by eurosurveillance.org.

The reason for this assessment is to provide clarity around discussions of a specific research paper that was reviewed and accepted in a single day. Some scientists think it is impossible to Peer-Review research within a single day. Other scientists claim the paper went thru -the much quicker- Rapid Review procedure, as outlined on the journal's web site. 

To assess commonality in the review and acceptance process at eurosurveillance.org, the author collected and analysed meta-data for all 1,595 publications since 01-Jan-2015. Earlier this week, the author shared the initial findings of this assessment in a Twitter post. 

This six-page document aims to make these findings reproducible and verifiable by offering step by step instructions.

Summary of Findings

  • Of the 17 types of articles published since 2015, three types occur most frequently: Rapid Communication (385), Research (312) and Surveillance (193).
  • The average number of days between Acceptance and Reception of Research type articles is 172 (2019) and 97 (2020). 
  • In line with the Editorial Policy for Authors, the category 'Rapid Communication' publications appear to be reviewed and accepted more quickly (18 days average) than type 'Research' and 'Surveillance.'
  • Except for this one Research article (on 22-jan-2020), no other article has ever been reviewed and accepted within a single day since 2015.

Picture that places the discussed Research Paper into context

For a Fully Interactive Dashboard, position. your browser in landscape and click here.

About the Author

Wouter Aukema has over 30 years of experience in processing data and analysing behaviour of organisations and systems for governments and corporations world-wide and develops analysis solutions for Fortune 100 companies. His publication at Defcon (20 years ago) caused headlines world wide. Wouter sees data and patterns where others don’t.

Download the Report

A PDF version of the report -with step by step instructions- can be downloaded here.

Thursday, March 12, 2020

Corona Pandemanic: Gentle Surgeons Make Stinky Wounds

Countries take different measures when it comes to the Corona Pandemanic and I was wondering if this could be visualized by the data published by John Hopkins University (JHU) on GitHub.

Here's my attempt:


What I produced is how various countries came to their 'score' of infections. Vertical axis: Percentage of total confirmed infections on March 11, 2020. Horizontal: Days, past 14 out of 49 total days since news came out. I picked a few countries from Asia, Europe and the USA with significant number of infections. The underlaying dashboard is available on my Tableau Public repository.

To my limited understanding, we are aiming for the top in a Bell-curve. This would mean that China reached their top already (see graph below for their full date-range) and that Singapore and South Korea are getting close to their tops.

If we look at USA, Netherlands, Germany etc. we are still climbing up the curve. Steeply. USA is climbing even more rapidly than we are here in The Netherlands.

In Dutch, we have a saying: "gentle surgeons make stinky wounds".

Confirmed cases in China:

Thursday, February 20, 2020

NOx Emissie door Industrie: Dit zijn ze.

In het rapport van het Adviescollege Stikstofproblematiek werd gemeld dat de NEC-sector Industrie 50,7 kiloton NOx had uitgestoten in 2017.

Hieronder staan al die bedrijven, in een interactieve dashboard. (klik op de picture.)
Overigens mooi om te zien dat de cijfers goed in de buurt komen van het rapport.

De gegevens zijn afkomstig van: http://www.emissieregistratie.nl/erpubliek/erpub/facility.aspx. Hierbij heb ik alle bedrijven geselecteerd waarvan cijfers over 2017 werden vermeld.





Monday, February 17, 2020

Convert Dutch RD coordinates to WGS-84

Some weeks ago, I found this great Tableau formula by Rik van Schaik to convert x,y co-ordinates from the Dutch Rijksdriehoek system to latitude, longitude.

Works great in PostgreSQL to:

 -- Based on a Tableau calculated field formula by Rik Van Schaik
 -- Added more decimals for precision
 -- Rewritten for PostgresSQL
 -- Replace x & y by field names of your choice
 -- feb-2020 aukema.org

select
 52.15517440 + ((
 (3235.65389 * ((y - 463000) * 10 ^ -5)) + 
 (-32.58297 * ((x - 155000) * 10 ^ -5) ^ 2) + 
 (-0.2475 * ((y - 463000) * 10 ^ -5) ^ 2) + 
 (-0.84978 * ((x - 155000) * 10 ^ -5) ^ 2 * ((y - 463000) * 10 ^ -5)) + 
 (-0.0655 * ((y - 463000) * 10 ^ -5) ^ 3) + 
 (-0.01709 * ((x - 155000) * 10 ^ -5) ^ 2 * ((y - 463000) * 10 ^ -5) ^ 2) + 
 (-0.00738 * ((x - 155000) * 10 ^ -5)) + 
 (0.0053 * ((x - 155000) * 10 ^ -5) ^ 4) + 
 (-0.00039 * ((x - 155000) * 10 ^ -5) ^ 2 * ((y - 463000) * 10 ^ -5) ^ 3) + 
 (0.00033 * ((x - 155000) * 10 ^ -5) ^ 4 * ((y - 463000) * 10 ^ -5)) + 
 (-0.00012 * ((x - 155000) * 10 ^ -5) * ((y - 463000) * 10 ^ -5))
 ) / 3600) as lat,

 5.38720621 + (( 
 (5260.52916 * ((x - 155000) * 10 ^ -5)) + 
 (105.94684 * ((x - 155000) * 10 ^ -5) * ((y - 463000) * 10 ^ -5)) + 
 (2.45656 * ((x - 155000) * 10 ^ -5) * ((y - 463000) * 10 ^ -5) ^ 2) + 
 (-0.81885 * ((x - 155000) * 10 ^ -5) ^ 3) + 
 (0.05594 * ((x - 155000) * 10 ^ -5) * ((y - 463000) * 10 ^ -5) ^ 3) + 
 (-0.05607 * ((x - 155000) * 10 ^ -5) ^ 3 * ((y - 463000) * 10 ^ -5)) + 
 (0.01199 * ((y - 463000) * 10 ^ -5)) + 
 (-0.00256 * ((x - 155000) * 10 ^ -5) ^ 3 * ((y - 463000) * 10 ^ -5) ^ 2) + 
 (0.00128 * ((x - 155000) * 10 ^ -5) * ((y - 463000) * 10 ^ -5) ^ 4) + 
 (0.00022 * ((y - 463000) * 10 ^ -5) ^ 2) + 
 (-0.00022 * ((x - 155000) * 10 ^ -5) ^ 2) + 
 (0.00026 * ((x - 155000) * 10 ^ -5) ^ 5)
 ) / 3600) as lon