System Engineer SRE - TSG Group

Wrocław

System Engineer SRE - TSG Group
Wrocław
NR REF.: 1103608

For our client, one of the biggest financial services companies we are looking forSite Reliability Engineers.

We are searching for outstanding candidates who are passionate about building and running some of the largest and most complex software artifacts on the planet and who have the ability to quickly understand how something works that they may never have seen before. They thrive to automate everything through code, and will help us make the journey through to a true no-ops model, identifying and automating end-to-end operational processes to deliver fast, efficient, and consistent results.

We look for the ability to solve problems with software, whether that’s been acquired from a textbook or at the school of hard knocks. Troubleshooting skills and the ability to unpack a problem into smaller pieces, identify possible causes, triage, and do so systematically are essential to this position. These skills could have been acquired through debugging code, operating a network, building hardware, or in other, entirely unrelated domains, however the cognitive skills and approaches to problem-solving are subject-matter agnostic and critical to have, regardless of a candidate’s background.

An ideal Site Reliability Engineer will have a broad range of skills across a number of systems. They engineer services, and are adept at making changes to an environment safely. We also expect our SRE’s to be collaborate and inclusive to produce great results.

Our team is currently involved in the following:

Greenfield – designing and building new platform with no technical debt
Multi datacenter, multi-region application and infrastructure containerization and orchestration
OS and Image Build Automation
Hosting complex Apps in a containerized environment
Comprehensive postmortems
Scaling and building fault tolerant systems
Secrets Delivery
Orchestrators
Chaos Testing
No-Ops
Registry
Automation, automation, automation!

Responsibilities:

Develop, engineer, and automate every operational process feasible to limit human interaction
Reduce human, manual toil through tooling and automation
Engineer products to be fault tolerant, resilient, and scalable
Be a part of a weekly, follow-the-sun, on-call rotation for their Site, providing 24x7 coverage
Respond quickly and resolve incidents aligned with a Site’s SLOs and SLA
Participate actively in Code Reviews
Write clear and concise documentation around procedures and processes
Actively participate and deliver postmortems in line with our postmortem culture
Drive a blameless, inclusive, and collaborative environment across teams
Exhibit ownership of action items resulting from postmortem and following them through to implementation and release
Investigate issues across multiple internal teams as well as external vendors
Uphold SDLC standards and release automation
Be deliberate and use data to make decisions, not intuition

Technical Qualifications:

Demonstrable hands-on experience with Linux, Docker and SRE or DevOps needed.
Experience with automation and configuration management – Ansible, Puppet or Salt.
Experience with fault and performance monitoring using tools like Prometheus, Grafana, Moogsoft or Influxdb
Experience with the architecture and implementation of PaaS software such as Cloud Foundry, Mesos / Marathon, or Kubernetes.
Deep understanding of Linux and OS Tuning.
Experience with troubleshooting complex problems and finding root cause in Linux systems.
Experience with virtualization technologies – kvm, vmware, etc.
Deep understanding of how to build fault tolerance and scalability systems.
Experience with Python – preferred or any other programming language.
Good understanding of Software Development Life Cycle, continuous integration and deployment, code reviews, testing, pipelines, git, Jenkins, etc.
Good understanding of building, deploying, and maintaining critical applications in a cloud based environment.
Grasp of software engineering skills in modular design, data structures, algorithms, and UNIX systems development.

What we can offer you:

Challenging, fun and supportive environment within the Site Reliability Engineering world, a discipline developed by Google which has become the latest and greatest concept in management and automation of large disparate systems
Work with state of the art enterprise and cutting edge technologies like docker, mesos, marathon, nomad, packer, vault, ansible, salt, consul, terraform, nexus, artifactory, ci/cd, influxdb, grafana, prometheus, gitlab, jenkins, vmware, azure, rhel, illumio, tanium, veritas cluster, solarix, aix, cloudbolt, chaos monkey, openscap, and many more.
Highly competitive benefits package including pension and private medical cover

Prosimy o aplikowanie poprzez przycisk znajdujący się po prawej stronie ogłoszenia.

System Engineer SRE - TSG Group

Hays Poland

System Engineer SRE - TSG Group

Podziel się ze znajomymi