Job Summary
The role is accountable for tactical and operational support for production services, across one or more areas of specific platform/domain.
- To ensure maximum service quality and stability through fast and effective response to technical incidents, and to be a catalyst for change via analysis and identification of continual service improvement opportunities. Depending on the area of technical specialisation, in addition to incident resolution and prevention, it may also be involved in a control capacity to ensure that new changes to the technology estate do not introduce instability.
- Manage technical resumption of high priority, S@R, medium/high severity incidents, provide end-to-end support and implement resolution to resolve incidents within SLA
- Provide root cause analysis for S@R, medium/high severity issues, ensure all follow up action points are carried out
- Responsible for the stability of the production system. Direct second and third level of support for problem diagnosis and resolution as per the agreed SLAs.
- Responsible for managing the production related changes, releases and rollouts with zero or minimal impact to the stability of the application. Review the dependent changes of the surround systems, infrastructure, networking etc... Responsible for ensuring proper technical plans are in place for all production changes (e.g. fallback plan, implementation plan, data conversion etc...)
- Create and update Production Support documentation, contingency (DR/BCP) documentation and processes.
- Provide inputs to PSS manager for monthly dashboard that provide information on incident and problem trends along with SIP and RCA Action Items.
- Participate & support in cross-training and knowledge transfer activities within support teams
Key Responsibilities
Service stability and incident management
• Ensure maximum service quality and stability through prompt and effective response to technical incidents.
• Act as a catalyst for change by performing incident and problem analysis, identifying root causes, and driving continual service improvement (CSI) initiatives.
• Where relevant, perform a control function to ensure that new technology changes do not introduce instability into the production environment.
Monitoring and observability
• Drive and achieve “north star” monitoring and observability goals.
• Build comprehensive monitoring, alerting, and logging are in place for critical services, enabling proactive detection and rapid remediation of issues.
Automation and operational excellence
• Automation of operational tasks such as deployments, monitoring, scaling, and infrastructure management to reduce manual effort and operational risk.
Site Reliability Engineering (SRE) practices
• Troubleshoot issues and participate in incident response, troubleshooting, and post-incident reviews (post-mortems) to minimise downtime and institutionalise learning from failures.
• Optimise infrastructure, systems, and processes for performance, efficiency, and reliability.
• Contribute to the design and implementation of robust deployment pipelines and release strategies that enable smooth, frequent, and reliable releases (e.g. blue/green, canary).
Change, release, and rollout management
• Review and implement production-related changes, releases, and rollouts with zero or minimal impact to application stability and client experience.
• Review and coordinate dependent changes across surrounding systems, infrastructure, networks, and shared services.
• Ensure thorough technical plans are in place for all production changes, including implementation steps, fallback/rollback strategies, data conversion or migration plans, and validation checks.
Reporting and continuous improvement
• Drive closure of remediation actions to prevent recurrence of incidents.
Collaboration, coaching, and knowledge sharing
• Participate in and support cross-training and structured knowledge transfer activities within and across support and engineering teams.
Leverage AI and automation for production engineering
• Use AI-driven tools (e.g. for log analysis, anomaly detection, alert correlation, and capacity forecasting) to proactively identify, diagnose, and resolve production issues.
• Collaborate with engineering and platform teams to integrate AI/ML capabilities into monitoring, incident management, and self-healing workflows (e.g. automated remediation, intelligent runbooks).
• Continuously review and refine AI-enabled alerts, models, and automations based on production behaviour, incident learnings, and feedback from support teams.
• Promote the safe and compliant adoption of AI solutions within production engineering, ensuring adherence to the bank’s risk, security, and data governance standards.
Governance
- Provide inputs to management for monthly dashboard that provide information on incident and problem trends along with SIP and RCA Action Items.
Regulatory & Business Conduct
- Display exemplary conduct and live by the Group’s Values and Code of Conduct.
- Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct, across Standard Chartered Bank. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.
- Lead the [team] to achieve the outcomes set out in the Bank’s Conduct Principles: [Fair Outcomes for Clients; Effective Financial Markets; Financial Crime Compliance; The Right Environment.] *
- Effectively and collaboratively identify, escalate, mitigate and resolve risk, conduct and compliance matters.
Key stakeholders
- Production Engineering Chapter area lead
- Production Engineering Chapter Lead
- Wealth Management - Product Owners
- Country Technology Management
- Technical Service Engineering team
Skills and Experience
- Minimum 4+ years of experience in application Production support responsibilities and stability as part of SRE
- JAVA / J2EE / Spring MVC, Spring Boot & Hibernate is mandatory
- Any one of the database DB 2 / Oracle / PostgreSQL
- Linux is must have
- Kubernetes, AWS is mandatory
- Tomcat/Jboss must have skill
Elasticsearch should have proficient user experience - Grafana should have proficient user experience
- AI tools - GitHub Co-Pilot
- Willing to work in rotational shift including night shift (24*7 support)
Qualifications
-
Any graduation related to computer science or equivalent field
Good to Have
- AWS Certification
- SRE Certification
- ITIL Certification
About Standard Chartered
We're an international bank, nimble enough to act, big enough for impact. For more than 170 years, we've worked to make a positive difference for our clients, communities, and each other. We question the status quo, love a challenge and enjoy finding new opportunities to grow and do better than before. If you're looking for a career with purpose and you want to work for a bank making a difference, we want to hear from you. You can count on us to celebrate your unique talents and we can't wait to see the talents you can bring us.
Our purpose, to drive commerce and prosperity through our unique diversity, together with our brand promise, to be here for good are achieved by how we each live our valued behaviours. When you work with us, you'll see how we value difference and advocate inclusion.
Together we:
- Do the right thing and are assertive, challenge one another, and live with integrity, while putting the client at the heart of what we do
- Never settle, continuously striving to improve and innovate, keeping things simple and learning from doing well, and not so well
- Are better together, we can be ourselves, be inclusive, see more good in others, and work collectively to build for the long term
What we offer
In line with our Fair Pay Charter, we offer a competitive salary and benefits to support your mental, physical, financial and social wellbeing.
- Core bank funding for retirement savings, medical and life insurance, with flexible and voluntary benefits available in some locations.
- Time-off including annual leave, parental/maternity (20 weeks), sabbatical (12 months maximum) and volunteering leave (3 days), along with minimum global standards for annual and public holiday, which is combined to 30 days minimum.
- Flexible working options based around home and office locations, with flexible working patterns.
- Proactive wellbeing support through Unmind, a market-leading digital wellbeing platform, development courses for resilience and other human skills, global Employee Assistance Programme, sick leave, mental health first-aiders and all sorts of self-help toolkits
- A continuous learning culture to support your growth, with opportunities to reskill and upskill and access to physical, virtual and digital learning.
- Being part of an inclusive and values driven organisation, one that embraces and celebrates our unique diversity, across our teams, business functions and geographies - everyone feels respected and can realise their full potential.