ProAct Analytical Platform
Design & Development from Scratch
Starting discovery phase of the project we had presentation, Excel sheets with some description of how indicators should be calculated and preliminary structure of datasets.
As a result our team of 6 specialists together with analysts and methodologists of Government Transparency Institute and consultants of World Bank have developed ProAct Platform containing analytical tools that provide access to open data from national electronic procurement systems of 46 countries and to open data of World Bank and IDB financed contracts for over 100 countries.
About our client
The prototype has been developed for the World Bank in collaboration with the Government Transparency Institute and the Centre for the Study of Corruption at the University of Sussex. The methodology for the platform is based in part on the methodology developed for the www.opentender.eu project, and the Global Integrity Anticorruption Evidence (GI-ACE) program funded by the UK’s FCDO. The ProACT platform benefits from data collected under these projects.
Our approach
The MK team in close cooperation with stakeholders has helped the client to develop the prototype in framework of moving goals forming user stories into features and proposing technical approaches and solutions solving customer’s requests on the fly.
Characteristics
6
46
2
120
60
21
5+
1
Technologies
Contractor
The World Bank
Collaborators
Government Transparency Institute
Delivery period
Prototype: January 2020 – June 2021
Modifications: May 2022 – July 2022
Goal/Business challenge
- Provide easy access to global available public procurement data from 48 datasets.
- Enable users to identify, analyze and monitor public procurement performance and integrity risks to inform preventive actions, process improvements, and policy reform.
- Help prevent corruption and promote transparency and integrity in public expenditures on goods works and services.
Results
Value delivered:
- optimal solutions to fit in customer budget and timeframes;
- reduced maintenance costs in AWS, automation of datasets’ uploads;
- fast performance, customization, and scalability of product.
Services
Services provided:
- Architecture development
- Infrastructure development on AWS
- Front- and Back-end code development
- DevOps optimization and support
The project consists of following parts
- Webserver VM, which runs frontend and backend servers;
- ELS – elasticsearch with processed data (indexes buyers, suppliers etc);
- On-demand EMR cluster, for datasets processing with results stored in ELS;
- S3 buckets: emr cluster bucket – used for storage of EMR processing code (Scala code for Spark); datasets bucket – used for raw datasets storage;
- Lambda function, which fires EMR when new dataset is uploaded to s3 bucket.
Datasets Upload
CI/CD pipeline is organized in such manner, that by event of uploading the new dataset or several datasets into the root of S3 folder the script is triggered and dataset(s) start to be processed/calculated. During this process the following is done, with confirmation to channel in Slack for every action:
- Check the file structure to confirm if it is a dataset.
- Reading, transforming and normalizing input data
- Check if ElasticSearch is up
- Notify if the cluster is not ready to scale
- Check if ElasticSearch is ready to upscale
- Start import with CI/CD pipeline
- Start ElasticSearch upscale
- Notification about finish of dataset import
- Notification about ElasticSearch stop
In case when several datasets are uploaded together they are put in the line for processing.
In case of line or big size of dataset ElasticSearch builds several EMR clusters. In this case the user will see notifications on Slack.
The upload of new dataset for the country that already has an old dataset in the system, does not require the operator to delete old dataset, but only to upload a new one.
During processing the import script calculates the meanings and compares the new data with data in the system. If it finds differences, they are updated by taking meanings from the new dataset.
EMR Cluster
Repo with Scala code for EMR Cluster. Repo has .gitlab-ci.yml file, which is responsible for builds and uploads of Scala code to the emr cluster bucket)
Repo with Terraform code used to create EMR cluster on demand for processing, along with upscaling of ELS (for timely import of massive amounts of data from EMR) and downscaling it back when processing is finished. Lambda function fires EMR import when the dataset is uploaded to the bucket and is located in python/lambda folder.
Update to the lambda code fires up deployment of the code to the AWS.
Generic Workflow description








