The mission of SRE (Site Reliability Engineer) team is to ensure the efficient and sustainable operation of the Shopee 24x7, and to build and maintain large-scale, highly available, high-performance distributed systems based on system availability and performance.
It is formed by combining traditional software engineering and technical operation. The SRE team needs to dive deep into the Shopee development lines to ensure that the system is highly scalable under rapid evolution of the System.
From the perspective of stability and performance, it includes the design of business development, components of the basic platform (middleware, container scheduling, caching, object storage, etc.
OS optimization, data center and network optimization. We optimize the inefficient and complicated operation in the traditional operation and maintenance mode through engineering and service means, and are committed to building a sound monitoring system to improve the efficiency of incident handling.
Job Description :
Deep dive into development lines, learning and understanding the mechanism of every application component, and promoting product scalability, stability and performance
Setup, manage and maintain Shopee product / middleware / big-data applications and services
Perform regular and ad-hoc server-side deployments, performance fine-tuning and troubleshooting
Design and develop automated technical operation platform
Capacity and Resource management
Responsible for the full-chain stress test to enhance the performance and remove redundancy of applications.
Prepare routine operation documentation
Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields
Less than 1 year of experience welcomed
Extensive and hands-on knowledge with Linux operating system (Ubuntu, CentOS, etc.)
Knowledge of Computer Network (TCP / IP, DNS, etc.), Computer Organisations and OS
Hands-on experience with at least one of the programming languages : Bash, Python, Go
Strong analytical and problem-solving skills with the ability to thrive under difficult and stressful situations
Passion and high sense of responsibility for work
Fast learning ability and a good team player
Detailed-oriented, cautious and prudent
Open to fresh graduates who are passionate about technical operations of internet products, Linux OS and OpenSource
Experience with automation tools like Ansible, SaltStack (Preferred)
Experience with monitoring tools like Prometheus, Zabbix, Grafan etc (Preferred)
Experience with load balancing tools like LVS, Nginx, Openresty or HAProxy (Preferred)
Experience with container technology such as Docker, Kubernetes (Preferred)
Experience with High Availability system design and Server Deployment Process (Preferred)
Experience with SRE (Preferred)
Experience with Ops Paas platform or Ops automation platform (ie : CMDB) (Preferred)