Hi! I'm Alex 👋

Site Reliability Engineer, o11y enthusiast, big fan of FOSS, ergo/mech/split keyboards and tech matters. Currently with the team at EPAM Systems, I have experience with both large enterprises and up-and-coming startups.

Click the images on Skills and Experience sectors for a more detailed description of what I’m doing/have done with the tools and companies listed.

For anyone wondering, this site is powered by Hugo, Cloudflare Pages and a S3+Clouflare Workers-powered CDN.

Ansible

Have written and maintained roles, collections, open- and closed-source modules. Also created Ansible development environment for multi-member teams, including third-party integrations such as:

  • Ansible Molecule
  • Ansible Ara
  • 1Password integration
  • Slack notification

Also made contributions to ansible-collections/community.grafana and provided my own modules to the community:

Python

Contributions to open-source projects:

Have created several smaller-scope closed-source production-level projects using Python, such as:

  • Google Cloud Functions
    • generate fiscal reports from DB (Cloud Scheduler, Cloud Functions, PubSub, Slack)
    • perform database maintenance using prepared statements (Cloud Scheduler, Cloud Functions, CloudSQL)
    • integration with third-party services (Cloud Functions, PubSub, RabbitMQ)
  • microservices (python & docker & skaffold & helm)
  • smaller system administration utilities

Go

My projects mostly evolve around the monitoring/observability spectrum (Prometheus exporters), with some serverless utilities and CLI tools written for integrations and other purposes.

Open-source contributions:

Monitoring

I’m a big fan of the OpenMetrics standard and believe it’s the way to go for the years to come, as more an more services are offering metrics directly in OpenMetrics-compatible format.

I have years of on-call and incident management experience, and am comfortable with the following monitoring/observability tools:

  • Prometheus
  • AlertManager
  • PushGateway
  • VictoriaMetrics
  • VMAlert
  • Grafana
  • Grafana Loki
  • GCP Monitoring suite
  • Pagerduty
  • Datadog

Linux

Many years of being a Linux user both for fun and profit. I enjoy working with a terminal and always try to keep up-to-date with kernel maintainer and most common distro announcements.

I have performed ’traditional’ system administration on VMs (Ubuntu, Debian, RHEL), including but not limited to:

  • bootloader customisations
  • deploy, maintain and manage applications using systemd, supervisord, Docker
  • perform automated host bootstrapping (userspace configuration, RBAC, common application deployment) using Ansible
  • deploy monitoring tools at host level (such as node-exporter)
  • kernel tuning and custom kernel compilation (Debian)

I like Debian and its derivatives, although I can work equally well with RedHat-based systems. My homelab runs on Ubuntu.

GCP

I have experience with designing architecture including deploying and managing (through IaC tools) a wide range of GCP APIs:

  • Landing zone design
  • GCE (instances, instance groups)
  • Networking (VPC connectors, forwarding rules and settings)
  • Google Cloud Functions (have deployed several both HTTP and Pub/Sub ones, prefer Pub/Sub)
  • Cloud Run (recently dived into, is a great, cheap alternative for avoiding to run a full K8S cluster)
  • IAM & Admin (creating and managing service accounts, IAM bindings and interconnections with external products)
  • GKE
  • Cloud Scheduler
  • Cloud Storage

Terraform

Extensive experience with green-, brown-field projects and migrations.

Implemented landing zone design for applications running on Azure and GCP.

Created closed-source, opinionated Terraform modules for production use.

Assisted in TechOps process optimisation by:

  • moving developer on- and off-boarding to git-based operations using Terraform
  • improving development standards by provisioning Git repository presets, with optimisations built-in:
    • pre-commit
    • default CI pipelines
    • GitHub issue templates
    • Automatic CODEOWNERS assignment
    • README and application docs templates
    • Makefile presets with mostly environment-setting utilities
    • RenovateBot configuration These presets allowed for development teams to get up-and-running with new repositories in seconds instead of days.

GitHub-Actions

  • Have created and contributed to testing, deployment, maintenance and compliance pipelines, some of them are open-source (GHA reusability is great):

  • Have created self-hosted runner instances on GCP and AWS using Packer and the Ansible provisioner.

Kubernetes

  • Design, provision and maintain GKE and AKS clusters.
  • Workload provisioning using FluxCD
  • Set up tooling clusters, including following open-source tools:
    • Harbor
    • Kyverno
    • Karpenter
    • Github Actions runner controller
    • Kubevela
    • Nginx ingress controller
    • CertManager
  • Implemented monitoring in Kubernetes clusters:
    • Grafana products (Mimir, Grafana Operator, Loki, Tempo)
    • Open-source products (Prometheus, Victoriametrics, AlertManager, OpenTelemetry)
    • Datadog agent
    • Sentry

Azure

Designed, created and maintained Azure Landing Zones, including the related module configuration in Terraform. Azure APIs used:

  • subscription management
  • user management (Entra ID)
  • AKS
  • Cosmos DB
  • Azure Container Instances
  • ServiceBus
  • Blob Storage
  • Functions
  • Monitor

Compliance

  • Implement automated security/vulnerability scanning for container images using Trivy (ISO 27001:2022)
  • Perform automated misconfiguration analysis and policy adherence to internal style guide using Checkov (ISO 27001:2022)
  • Secret management tool (1Password) integration with GKE, Ansible
  • Use Google Compute Engine/OS Patch Management to automate package updates in a safe way

Skills

Ansible

Python

Go

Monitoring

Linux

GCP

Terraform

GitHub-Actions

Kubernetes

Azure

Compliance

Senior Systems Engineer (DevOps)

EPAM Systems

I joined EPAM Systems in September 2022, and so far I have worked with 2 high-profile European clients from the investment banking sector (one of them in Fortune Global 500). My responsibilities include:

  • Leading design and implementation infrastructure migrations to public cloud platforms (GCP, Azure)
  • Implemented SLO/SLIs in IaC
  • Automated onboarding for new users on development platforms
  • Streamlined use of external container images from development teams
  • Assisted and tutored team members in technical troubleshooting
  • Migrated monolithic application to microservices/cloud-native pattern (Docker, Helm, FluxCD, Kubernetes)
  • Azure/GCP Landing Zone design and implementation (Terraform)
  • Automated compliance (vulnerability/security/policy scanning) using Terraform Sentinel, Trivy and Checkov
  • Developed CI pipelines in Github Actions and Bitbucket Pipelines (and code for the workflows in Python and Go)
  • Standardised VSC repository format (Terraform, Github)
    • streamlined issue reporting across all repos
    • created default compliance and reporting pipelines
    • enhanced adoption of testing and code checking tools from developers since they came pre-configured with the app
  • Compliance in ISO 27001:{2013,2022}, SOC2
  • Design and implement observability stacks:
    • LGTM stack + Prometheus
    • Google Cloud Operations Suite
    • Azure Monitor

Measurable outcomes:

  • Achieved ~40% cost reduction in CI pipelines by developing self-hosted, self-registering runners on spot instances (Packer, Ansible, Terraform)
  • Achieved ~70% cost reduction for applications by optimising configuration
    • spot instances
    • autoscaling
    • savings plan/pre-allocated instances
    • preemptible instances for development environments
    • automated shutdown of non-critical instances out of business hours
    • moved workloads from VMs to microservices/serverless, saving on compute costs

Site Reliability Engineer

The Remote Company

The Remote Company is a SaaS solution developer, specialising mostly in marketing/sales automation and web development. I joined the company at 42 employee count, and helped scale operations to 250 employees, serving over 1,5 million clients on a daily basis. Our flagship products were acquired by Vercom in 2022 for >90 million Euros, the largest M&A in Eastern Europe for 2022.

Projects:

  • Configured observability platform from scratch for all products
    • Grafana
    • VictoriaMetrics
    • VMAgent
    • Loki
    • Alertmanager
    • Pushgateway
    • Pagerduty
    • Sentry
    • GCP Logging integration with Loki (using Pub/Sub)
  • Participated in planning and migration from third-party ISP to hyperscaler provider
  • Participated in design and drafting of SLA documents
  • Presented quarterly roadmap/progress reports to company stakeholders
  • Expanded configuration management and automated technical procedures
  • Developed serverless maintenance routines for CloudSQL instances (Google Cloud Functions, PubSub, Python, CloudSQL/Postgresql, SQL prepared statements) in coordination with DBRE
  • Developed serverless reporting utilities (Google Cloud Functions, Cloud Storage, PubSub, Python, CloudSQL/Postgresql, Slack API)
  • Participated in regular on-call rotation
  • Managed infrastructure in Terraform
  • Participated in planning, design, implementation, monitoring and maintenance of in-house SaaS products
  • Developed and deployed applications using the microservices and serverless models using Python and Go
  • Developed and maintained the MailerSend Python SDK
  • Created CI pipelines to cut deployment and testing times
  • Assisted and tutored junior team members

DevOps/Tech Support Engineer

Pressidium

  • Wordpress stack administration (Varnish, nginx, highly-available MySQL)
  • Developed troubleshooting utilities in Go/Python/bash
  • Developed serverless Wordpress vulnerability management automation using Python and WPVulnDB API (now WPScan)
  • Tutored and onboarded junior team members
  • Communicated with customer development teams
  • Developed Ansible roles and playbooks
  • Participated in on-call rotation
  • Configured database replication and monitoring using Percona xtraDB and PMM2
  • Deployed and maintained highly available cluster storage (DRBD)
  • Created documentation on several domains
    • System architecture docs
    • Troubleshooting runbooks
    • Troubleshooting tool documentation
    • Development style guides/best practices
    • Standard Operating Procedures (SOP)

DevOps Engineer (mil.service)

Hellenic Army

  • Automated web infrastructure (Ansible, HAProxy, apache2, MySQL) for Drupal and WordPress applications
  • Ansible custom module development
  • ISO 27001:2013 compliance
  • Implemented continuous security scanning of VM instances
  • Assisted developers with deployment operations
  • Automated report generation
  • Monitoring administration and alert development using Nagios
  • Participated in NATO Locked Shields 2018 cyber-security joint simulation exercise as Firewall Engineer for Greece’s Blue Team

Internet Operations Supervisor

Galanis Sports Data

  • Promoted from GFX & Statistical Software Operator
  • Coordinated in-field Operations teams nationwide
  • Streamlined data gathering from in-field operators
  • Developed data checking utilities in Python
  • Developed Java GUI program for use in televoting events (Epsilon TV)
  • Managed ISO 9001 compliance
  • Developed automated reports for use in live TV and statistical archives

Network and VoIP Engineer

Soldecom

  • Developed replicable VoIP system using Ansible and Asterisk PBX for satellite (VSAT) use in shipping vessels
  • Configured routers for use as Captive Portals using OpenWRT
  • Managed authentication with Kerberos
  • Automated physical server OS installation and configuration using PXE boot and Ansible
  • ISO 27001 compliance
  • Created documentation for the designed solutions

GFX & Stats Software Operator (PT)

Galanis Sports Data

  • Part-time/afternoon job (Wednesday/Thursday/Friday/Saturday/Sunday) gathering data for sports events
  • Operator for statistical software using domain-specific language for various sports
  • TV Graph generation engine operation
  • Documentation and Standard Operating Procedures development
  • Travel time: >80% (off-site in fields, courts, stadiums etc)

Experience

EPAM Systems

Senior Systems Engineer (DevOps)

The Remote Company

Site Reliability Engineer

Pressidium

DevOps/Tech Support Engineer

Hellenic Army

DevOps Engineer (mil.service)

Galanis Sports Data

Internet Operations Supervisor

Soldecom

Network and VoIP Engineer

Galanis Sports Data

GFX & Stats Software Operator (PT)

Education

Newcastle University

MSc Computer Security and Resilience

2018 - 2019

The Open University UK

BSc Information Technology

2013 - 2017

The American College of Greece

BSc Information Technology - Networking Technologies

2012 - 2017