Distributed Data Pipeline Engineer

Metaweb's data pipeline operations team is responsible for making sure that the cutting-edge algorithms written by our world-class semantic engineers are continuously (and consistently) populating Freebase with ever more topics and assertions. We need your help in building out our pipeline architecture, integrating in new algorithms, and writing monitoring frameworks to make sure everything is running smoothly. We've also got a number of data sources that need to be analyzed and mined.

Your efforts will help us grow Freebase into a compendium of all the world's knowledge!

We are looking for someone who has:

At Metaweb, you'll be:

Instructions

If this sounds like you, then please send us your resume in HTML or pdf format to jobs@metaweb.com.
Let yourself stand out from the crowd by sending us your thoughts on the following:
  1. A reliable data pipeline consists of more than just continuous running of code that was successful for single data load. What are some design strategies for a data pipeline that will increase reliability, auditability, and maintainability? Are there other important characteristics that a data pipeline should have?
  2. Complex data operations often produce output that cannot be automatically audited for correctness, because there is no "gold standard" to compare to (other than the algorithm itself). Discuss three strategies for assuring that these algorithms are indeed running properly.
  3. As the number of data sources and algorithms added to an data pipeline increases, the chances of software system failure also increases. What failure modes are to be expected as the pipeline's complexity increases? How can they be prevented?
  • Principals only. Recruiters, please don't contact us about this job.
  • Please, no phone calls about this job.
  • Please do not contact us about other services, products or commercial interests.
  • Reposting this message elsewhere is OK.

Return to job listings   Print job description