Site Reliability Engineer / Platform Engineer Job Details

Job Details

Site Reliability Engineer / Platform Engineer

Job Description

Requisition Number: 49061

Job Location: Bukit Jalil KL, MYS

Global Grade: Band 6

Work Type: Office Working

Employment Type: Permanent

Posting Start Date: 11/03/2026

Posting End Date: 30/04/2026

Job Description:

Our Ideal Candidates Should Have:-

We are looking for a Site Reliability Engineer (SRE) / DevOps Engineer / Platform Engineer responsible for ensuring the reliability, availability, performance, and efficiency of our production systems and services. The role involves collaborating closely with development, infrastructure, and support teams to build robust, scalable, and observable platforms. Find the below responsibility and skill set which we expected.

Bachelor's degree in computer science, Information Technology, or related field (or equivalent experience).
Proven experience (5+ years) in below technical expertise.
Candidate with Mandarin language proficiency is value added.

Role Overview

The resource will be responsible for ensuring the reliability, availability, performance, and scalability of critical applications and infrastructure. This includes designing and supporting application and system architectures, managing cloud and container platforms, implementing observability, and supporting CI/CD pipelines. The role involves close collaboration with development, infrastructure, and security teams to support production and non production environments.

Core Technical Skills

Application & System Architecture

Strong understanding of distributed systems and modern application architectures (e.g., microservices, event-driven, API-based).
Knowledge of load balancing, reverse proxies, and API gateways (e.g., Nginx, HAProxy, AWS ALB/NLB, API Gateway, Kong, Istio).
Familiarity with relational and/or NoSQL databases (e.g., PostgreSQL, MySQL, Oracle, MongoDB, Redis), including basic performance tuning and high availability concepts (replication, clustering).

Operating Systems

Hands-on experience administering and troubleshooting Linux/Unix systems (e.g., RHEL, CentOS, Ubuntu).
Proficiency with system-level tools (systemd, journalctl, netstat/ss, top/htop, iostat, vmstat).
Ability to analyse system logs and perform root cause analysis for performance and stability issues.

Cloud Platforms

Experience with cloud provider: AWS.

Hands-on work with core services such as compute (EC2/VMs), storage (S3/Blob/Cloud Storage), networking (VPC/VNet, security groups/NSGs), and managed databases where applicable.
Understanding of cloud networking (subnets, routing, security groups, load balancers) and IAM/role-based access control.

Observability & Monitoring

Experience implementing and managing monitoring/observability solutions such as:
Prometheus & Grafana for metrics and dashboards
ELK/Elastic Stack (Elasticsearch, Logstash, Kibana) or OpenSearch stack
Splunk for centralized logging and alerting
Cloud-native tools (e.g., CloudWatch, Azure Monitor, GCP Cloud Monitoring & Logging)
Ability to define key metrics (SLIs/SLOs), set up alerts, and build dashboards to monitor application and infrastructure health.
Experience in capacity planning and performance tuning based on monitoring data.

CI/CD & Automation

Familiarity with CI/CD tools and practices, such as:
Jenkins/
GitLab CI
GitHub Actions
Azure DevOps Pipelines
Ability to:
Design, configure, and maintain build and deployment pipelines.
Implement automated unit/integration tests within pipelines.
Integrate security and compliance checks (SAST/DAST, dependency scanning).
Knowledge of infrastructure-as-code is a plus (e.g., Terraform, CloudFormation, Ansible).

Containers & Orchestration

Practical experience building, running, and troubleshooting Docker containers.
Experience with Kubernetes (on-prem or managed services like EKS/AKS/GKE), including:
Deployments, Services, Ingress, ConfigMaps, Secrets.
Health checks, autoscaling (HPA), resource limits/requests.
Basic understanding of cluster operations, namespaces, RBAC, and monitoring within K8s.

Scripting & Programming

Proficiency in at least one scripting or programming language, such as:
Python (preferred)
Shell scripting (bash/sh)
Go (nice to have)
Ability to automate operational tasks, write utilities/tools, and integrate with APIs (REST/JSON).
Experience with version control (Git) and collaborative workflows (branching, pull requests).

Incident Management & Reliability

Strong problem-solving skills with experience handling critical production incidents.
Ability to perform incident triage, identify ownership (app vs infra vs network), and coordinate resolution with multiple teams.
Experience with incident management and on-call practices (e.g., runbooks, escalation paths, post-incident reviews).
Familiarity with tools like ServiceNow, or similar incident/ticketing systems is an advantage.

Soft Skills & Ways of Working

Good communication skills (written and verbal) with the ability to explain technical issues to both technical and non-technical stakeholders.
Proven experience collaborating with cross-functional teams (development, QA, security, operations, product).
Ability to work in agile environments (Scrum/Kanban), participate in ceremonies, and contribute to continuous improvement.
Strong ownership mindset, attention to detail, and ability to operate under pressure during high-severity incidents.

Preferred / Nice-to-Have

Experience in regulated or enterprise environments (e.g., banking/financial services) with compliance and security considerations.
Knowledge of security best practices for cloud, containers, and CI/CD (e.g., secrets management, least privilege, vulnerability management).

RESPONSIBILITIES

Resiliency
• Team member to enhance application service and infrastructure resilience through self-healing and automated failovers - target a 99.99% up-time to customers.
• Assist in the running of planned random disruption of production infrastructure to ensure accountability for building resilient, always-on systems.
• Build resilience into the application so underlying system failures are handled gracefully and do not impact end users. Influence design/development teams to always be thinking of the rainy-day scenarios.

Efficiency
• Identify opportunities to eliminate all manual and repeatable activities (toil) via tooling and automation
• Reduce the number of repeat incidents by permanently fixing the underlying root cause of issues

Capacity Planning
• Enhance application and infrastructure scalability via iterative capacity management with the goal of reducing the effort required for capacity reviews through deep monitoring and auto-scale properties.
• Continuously monitor capacity for any discrepancies or spikes

Business
Availability/Reliability/Performance
• Design, Code, implement break fixes to improve service availability based on outcomes of thematic reviews
• Participate in post mortem reviews helping to ensure each exercise is a blameless “adjust” opportunity
• Monitor SLIs/SLOs in partnership with Product Teams to achieve the optimal development velocity

Processes
Monitoring
• Optimize monitoring to reduce false positive alerts
• Creatively deepen monitoring capabilities leveraging the 3 tenets of observability – logs, metrics and traces
• Ensure all critical user service journeys are traceable end to end
• Ensure Production Solutions are fit for purpose. Where gaps are identified put a plan in place to uplift the toolset

People and Talent
• Lead through example and build the appropriate culture and values. Set appropriate tone and expectations in the team and work in collaboration with other team members.

Risk Management
• Identify key issues in the business areas being supported, and based on this information, put in place appropriate controls and measures to assess, monitor, control & mitigate risks.
• Ensure a full understanding of the risk and control environment within Technology Services.
• Ensure support procedures are in place and adhere to Group Security & Audit policies within Technology Services.
• Active engagement with all audit issues arising in this support environment.

Governance
• Responsible for assessing the effectiveness of the Group’s arrangements to deliver effective governance, oversight and controls in the business and, if necessary, oversee changes in these areas
• Awareness and understanding of the regulatory framework, in which the Group operates, and the regulatory requirements and expectations relevant to the role.
• Responsible for delivering ‘effective governance’; capability to challenge fellow executives effectively; and Willingness to work with any local regulators in an open and cooperative manner.

Regulatory & Business Conduct
• Display exemplary conduct and live by the Group’s Values and Code of Conduct.
• Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct, across Standard Chartered Bank. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.
• Lead the Production Engineering team to achieve the outcomes set out in the Bank’s Conduct Principles: Fair Outcomes for Clients; Effective Financial Markets; Financial Crime Compliance; The Right Environment.
• Effectively and collaboratively identify, escalate, mitigate and resolve risk, conduct and compliance matters.

Key Stakeholders
• Business Heads in the country and the group
• Domain Heads in Tech Services
• Country CIO and CTM
• Business CIO

About Standard Chartered

We're an international bank, nimble enough to act, big enough for impact. For more than 170 years, we've worked to make a positive difference for our clients, communities, and each other. We question the status quo, love a challenge and enjoy finding new opportunities to grow and do better than before. If you're looking for a career with purpose and you want to work for a bank making a difference, we want to hear from you. You can count on us to celebrate your unique talents and we can't wait to see the talents you can bring us.

Our purpose, to drive commerce and prosperity through our unique diversity, together with our brand promise, to be here for good are achieved by how we each live our valued behaviours. When you work with us, you'll see how we value difference and advocate inclusion.

Together we:

Do the right thing and are assertive, challenge one another, and live with integrity, while putting the client at the heart of what we do
Never settle, continuously striving to improve and innovate, keeping things simple and learning from doing well, and not so well
Are better together, we can be ourselves, be inclusive, see more good in others, and work collectively to build for the long term

What we offer

In line with our Fair Pay Charter, we offer a competitive salary and benefits to support your mental, physical, financial and social wellbeing.

Core bank funding for retirement savings, medical and life insurance, with flexible and voluntary benefits available in some locations.
Time-off including annual leave, parental/maternity (20 weeks), sabbatical (12 months maximum) and volunteering leave (3 days), along with minimum global standards for annual and public holiday, which is combined to 30 days minimum.
Flexible working options based around home and office locations, with flexible working patterns.
Proactive wellbeing support through Unmind, a market-leading digital wellbeing platform, development courses for resilience and other human skills, global Employee Assistance Programme, sick leave, mental health first-aiders and all sorts of self-help toolkits
A continuous learning culture to support your growth, with opportunities to reskill and upskill and access to physical, virtual and digital learning.
Being part of an inclusive and values driven organisation, one that embraces and celebrates our unique diversity, across our teams, business functions and geographies - everyone feels respected and can realise their full potential.

Information at a Glance

Provider	Description	Enabled
LinkedIn	LinkedIn is an employment-oriented social networking service. We use the Apply with LinkedIn feature to allow you to apply for jobs using your LinkedIn profile. Opting out of LinkedIn cookies will disable your ability to use Apply with LinkedIn. Cookie Policy Cookie Table Privacy Policy Terms and Conditions
Google Analytics	Google Analytics is a web analytics service offered by Google that tracks and reports website traffic. Cookie Information Privacy Policy Terms and Conditions
Google Tag Manager	Google Tag Manager is a tag management system for conversion tracking, site analytics, remarketing and more. Privacy Policy Terms and Conditions