About Us
We're building the future of uncensored AI infrastructure & products. Our technology powers hyper-immersive experiences and enables the ownership of personalized, interoperable AI characters, unlocking vast monetization opportunities across our ecosystem and beyond.
We are initially focused on the Creator and Social-Fi landscapes, building interoperable 'superModel' characters powered by our advanced proprietary multi-modal, uncensored AI models. These superModels can be first experienced on our platform, OhChat, with additional platform integrations in the works.
OhChat, has gained 70,000 users across 174 countries in a matter of weeks. The site allows users to enjoy hyper-immersive experiences with digital AI characters, enabling real-time interactions and uncensored exchanges with original characters as well as ‘digital twins’ who are based on both celebrities and real-world creators, launched in partnership with them.
Website: https://chat.oh.xyz/
Job Overview
As a Data Engineer at Oh, you will play a crucial role in building and optimizing our data pipeline and infrastructure. You’ll be responsible for data collection, particularly large-scale image scraping, and managing structured and unstructured datasets for training generative AI models. You will work closely with machine learning engineers and developers to ensure data quality, availability, and scalability.
Key Responsibilities
- Data Pipeline Development: Design, build, and maintain data pipelines to support the collection, ingestion, and processing of large-scale image, video, and audio datasets.
- Data Scraping and Collection: Develop and optimize web scraping scripts to collect high-quality multimedia datasets
- Data Storage and Management: Implement efficient storage solutions for large volumes of structured and unstructured data, ensuring data accessibility and scalability.
- ETL Processes: Develop and manage ETL processes to transform raw data into formats suitable for model training.
- Data Quality Assurance: Ensure data quality and consistency across different sources. Implement monitoring tools and workflows to maintain data accuracy and relevance.
- Documentation: Maintain clear documentation of data sources, scraping processes, and pipeline workflows for team reference and reproducibility.
Required Skills & Qualifications
- Programming Languages: Proficiency in either Python or JavaScript for data scraping, ETL, and pipeline development.
- Web Scraping: Experience with web scraping tools and libraries (e.g., BeautifulSoup, Scrapy).
- Data Storage and Processing: Experience with databases (SQL and NoSQL, such as PostgreSQL, MongoDB) and cloud storage (e.g., AWS S3, RedShift).
- Data Pipeline and Workflow Orchestration: Familiarity with data pipeline tools such as Apache Airflow, Prefect, or Luigi.
- Data Transformation: Strong knowledge of data transformation and processing techniques (e.g., Pandas, Dask for Python).
- Data Quality Control: Experience with data quality monitoring tools (e.g. dbt, Great Expectations).
- Version Control: Proficient in using Git for version control, as well as data versioning tools (e.g., DVC)
- Pipeline Monitoring: Strong experience implementing and owning pipeline monitoring stacks (e.g., Sentry, Grafana, AWS CloudWatch)
- Testing and code quality: Extensive experience with common frameworks for unit, behavioural, integration, and end-to-end testing (e.g., Pytest, Behave, Postman) and general code quality tools and principles (e.g., Ruff, MyPy, Bandit, Black).
Preferred Qualifications
- Experience in Generative AI Data Collection: Understanding of the types of data needed for training generative AI models (e.g., GANs, LLMs, diffusion models).
- Knowledge of ML/DL Basics: Familiarity with machine learning concepts, particularly around data needs for training and evaluation in the context of generative models.
- Familiarity with Blockchain: Though not mandatory, a keen interest in the blockchain ecosystem and data sources is an advantage.
- Data Governance: Understanding of legal and ethical implications of data collection, including copyright and privacy concerns.
- Experience with Image and Video Processing: Familiarity with libraries for image processing (e.g., OpenCV, PIL) and video data handling is a plus.
- Big Data Experience: Familiarity with big data tools and frameworks (e.g., Spark, Hadoop) is a plus.
- DevOps:Some experience with common DevOps tools (e.g. CI/CD pipelines, Terraform/CDK, Docker) and best practices are a bonus.
As part of our team, you’ll enjoy:
- The hustle of a startup with the impact of a global business
- Tremendous opportunity to join a business pioneering the future of AI
- Working with an extraordinary team of smart, creative, fun and highly motivated people
- Flexible working hours, including remote working
- Modern, uplifting work environment
- Pension scheme
- Generous starting salary