ProAct Analytical Platform

Design & Development from Scratch

Introduction

About our client

Our approach

Characteristics

6

specialists in MK project team

46

procurement datasets of countries

2

procurement datasets of World Bank and IDB financed contracts worldwide

120

countries’ data in datasets

60

data ranges

21

contracts

5+

suppliers

1

buyers across 120 countries

ProAct – Global Procurement Anticorruption & Transparency platform prototype development

Contractor

The World Bank

Collaborators

Government Transparency Institute

Delivery period

Prototype: January 2020 – June 2021
Modifications: May 2022 – July 2022

Goal/Business challenge

  • Provide easy access to global available public procurement data from 48 datasets.
  • Enable users to identify, analyze and monitor public procurement performance and integrity risks to inform preventive actions, process improvements, and policy reform.
  • Help prevent corruption and promote transparency and integrity in public expenditures on goods works and services.

Results

Value delivered:

  • optimal solutions to fit in customer budget and timeframes;
  • reduced maintenance costs in AWS, automation of datasets’ uploads;
  • fast performance, customization, and scalability of product.

Services

Services provided:

  • Architecture development
  • Infrastructure development on AWS
  • Front- and Back-end code development
  • DevOps optimization and support

The project consists of following parts

  1. Webserver VM, which runs frontend and backend servers;
  2. ELS – elasticsearch with processed data (indexes buyers, suppliers etc);
  3. On-demand EMR cluster, for datasets processing with results stored in ELS;
  4. S3 buckets: emr cluster bucket – used for storage of EMR processing code (Scala code for Spark); datasets bucket – used for raw datasets storage;
  5. Lambda function, which fires EMR when new dataset is uploaded to s3 bucket.

Datasets Upload

CI/CD pipeline is organized in such manner, that by event of uploading the new dataset or several datasets into the root of S3 folder the script is triggered and dataset(s) start to be processed/calculated. During this process the following is done, with confirmation to channel in Slack for every action:

  • Check the file structure to confirm if it is a dataset.
  • Reading, transforming and normalizing input data
  • Check if ElasticSearch is up
  • Notify if the cluster is not ready to scale
  • Check if ElasticSearch is ready to upscale
  • Start import with CI/CD pipeline
  • Start ElasticSearch upscale
  • Notification about finish of dataset import
  • Notification about ElasticSearch stop

In case when several datasets are uploaded together they are put in the line for processing.
In case of line or big size of dataset ElasticSearch builds several EMR clusters. In this case the user will see notifications on Slack.
The upload of new dataset for the country that already has an old dataset in the system, does not require the operator to delete old dataset, but only to upload a new one.
During processing the import script calculates the meanings and compares the new data with data in the system. If it finds differences, they are updated by taking meanings from the new dataset.

EMR Cluster

Repo with Scala code for EMR Cluster. Repo has .gitlab-ci.yml file, which is responsible for builds and uploads of Scala code to the emr cluster bucket)

Repo with Terraform code used to create EMR cluster on demand for processing, along with upscaling of ELS (for timely import of massive amounts of data from EMR) and downscaling it back when processing is finished. Lambda function fires EMR import when the dataset is uploaded to the bucket and is located in python/lambda folder.

Update to the lambda code fires up deployment of the code to the AWS.

Generic Workflow description

GOT QUESTIONS?