Job DescriptionData Scientist | AI Startup | 3 Month Contract | £800/day
Company Overview
Join an innovative startup at the forefront of AI and cybersecurity, specialising in the development of advanced models for binary analysis and reverse engineering. They are pioneering new techniques to push the boundaries of cybersecurity using AI-driven approaches.
Position Overview
We are seeking a forward-thinking Data Scientist with expertise in modern compiler frameworks to join our team on a contract basis. In this role, you will be instrumental in expanding a vast dataset of binaries, crucial for training cutting-edge AI models.
Responsibilities
- Work closely with the machine learning team to identify and establish data needs for model development.
- Design and implement automated workflows for large-scale data acquisition, compilation, and processing across multiple operating systems and architectures.
- Write scripts to scrape source code from repositories and package managers across various platforms.
- Automate the compilation process using a variety of compilers and configurations across different operating systems.
- Ensure data consistency and quality by standardising storage and processing techniques.
- Extract and label high-quality data from compiled binaries for supervised machine learning tasks.
- Maintain clear documentation for data collection and processing pipelines.
Experience
- Advanced degree in Computer Science, Data Science, or a related discipline.
- Expertise in programming languages such as C/C++ and Python.
- In-depth knowledge of modern compiler frameworks (e.g., GCC, LLVM, Visual Studio).
- Experience managing complex software projects across multiple operating systems (Linux, Windows, macOS).
- Familiarity with cloud platforms such as AWS, GCP, or Azure.
- Proficient in version control systems (e.g., Git) and containerisation tools (e.g., Docker).
Nice to Have
- Experience with binary analysis and reverse engineering.
- Knowledge of machine learning fundamentals.
- Experience working with massive datasets (terabyte-scale).