Metaweb's data pipeline operations team is responsible for making sure that the cutting-edge algorithms written by our world-class semantic engineers are continuously (and consistently) populating Freebase with ever more topics and assertions. We need your help in building out our pipeline architecture, integrating in new algorithms, and writing monitoring frameworks to make sure everything is running smoothly. We've also got a number of data sources that need to be analyzed and mined.
Your efforts will help us grow Freebase into a compendium of all the world's knowledge!
We are looking for someone who has:
- Experience with large-scale operational data processes, such as news processing, sentiment analysis, preference matching, data-centric web applications, and or web indexing operations
- Experience with using and administering parallel machine clusters and distributed file systems. The ideal candidate will have previous experience with implementing data processing algorithms in a map-reduce framework, such as hadoop
- Experience with system administration and server configuration management
- A passion for process, detail, and quality
- Built real-time monitoring and test frameworks to assure operational quality
- Understands the difference between production code and quick scripts, and is comfortable creating either when required
At Metaweb, you'll be:
- Architecting stable, scalable data processing frameworks, including job scheduling and execution frameworks
- Managing a hadoop cluster and building software to help manage and monitor the cluster
- Designing solutions for problems of massive scale
- Working with the best, brightest, and funnest people in the industry
Instructions
If this sounds like you, then please send us your resume in HTML or pdf format to jobs@metaweb.com.Let yourself stand out from the crowd by sending us your thoughts on the following:
- A reliable data pipeline consists of more than just continuous running of code that was successful for single data load. What are some design strategies for a data pipeline that will increase reliability, auditability, and maintainability? Are there other important characteristics that a data pipeline should have?
- Complex data operations often produce output that cannot be automatically audited for correctness, because there is no "gold standard" to compare to (other than the algorithm itself). Discuss three strategies for assuring that these algorithms are indeed running properly.
- As the number of data sources and algorithms added to an data pipeline increases, the chances of software system failure also increases. What failure modes are to be expected as the pipeline's complexity increases? How can they be prevented?
Metaweb is an Equal Opportunity Employer and does not unlawfully discriminate on the basis of any status or condition protected by applicable federal or state law.
- Principals only. Recruiters, please don't contact us about this job.
- Please, no phone calls about this job.
- Please do not contact us about other services, products or commercial interests.
- Reposting this message elsewhere is OK.
