At OSCON 2008, Mike Hendrickson interviewed Jason Hunter about MarkMail.org a site which archives 34 million email messages from 6,470 open source mailing lists. Mike asks Jason about the technology behind Markmail.org and how MarkLogic's products can scale to handle Petabyte-scale data.
Mike Hendrickson: Hey, we're here with Jason Hunter at OSCON 2008. Welcome to Oregon. We have some questions for Jason about big data, how do you scale data and how Mark Logic and Mark Mail fit into that whole scheme.
So, Jason, tell me. How much mail can Mark Mail handle? I've heard of terabyte and petabyte scale data. Can you guys do that?
Jason Hunter: Well, right now, I'd have a terabyte with 17 million mail that's - for every mail that has an attachment, we're storing the attachment, we're storing PDFs and they'll be tests done on the images, so the actual XML size is smaller than that. I plan to go much larger than that. I plan to go - I know that there's at least 100 million really good open source messages out there, open source mail. I think we should probably say what Mark Mail is.
Mark Mail is a - and markmail.org - is a website for searching and visualizing ____ source emails. It's designed to make it easier to find answers, easier to identify trends, easier to see who's expert on things; to decide if something's increasing in popularity, decreasing, stuff like that. And so, we have launched it and we have a lot of open source mails in it, presently 16.7 million. It's growing every day.
And it's not stressing us out at all right now at that size; we expect it'll grow bigger. I know, I did a back of the envelope of about $100 million really good open sources messages, but even beyond that, when there's not open sources messages to load, and I think that we can get to a billion easy.
Mike Hendrickson: And you can handle that?
Jason Hunter: We have a customer that has 200 terabytes of XML type and so if you scale that out, I can do that and probably every public email message that's ever been done that still fits in that size if you _____ to XML.
Mike Hendrickson: So, Jason, tell us what technologies you use at Marklogic?
Jason Hunter: Alright, what tech? Part of the language is we write the site in XQuery, which is W3C standard language; it's the native language of Mark Logic's server on which we're built and it's this XML aware language and it works really well for us because all of our messages are XML. We convert email into an XML format and we use XQuery then depresses them. We can search them, manipulate them, render them and we do all of that in this XML ___ language. It's not just that we use Java to convert messages into the XML; we use Perl to acquire the messages a lot of times. We use Java Script on the front end to make it the single page, you know, one page web app. experience, so that every time you're taking an action, we don't have to refresh the page. It's all that embedded thing. Then we have a five lug balancer that we program into Tickle. The core server is written in C++. Does HTML count?
Mike Hendrickson: Probably not. So, it sounds like you've got a really interesting mix of technologies. How do you find the right developers to work at - to work on Mark Mail?
Jason Hunter: It can be a challenge to find really talented people. We've taken the approach right now of just hiring the best people we can find and, usually, languages aren't that awful to learn if you're a really expert person and we've got some people who have a lot of experience in these languages or either have picked 'em up. The TCL is funny, though. The guy who did the TCL, I saw the check and I was like, "Was that TCL that I just saw?" He's like, "Yeah." "Did you ever program TCL before?" "No."
"Alright, you just picked up." So, if you find the right people, you can do that. We've - we're kind of lucky in that we're creating a site, markmail.org, which people who are good programmers kind of use already and so, it's nice when you're doing recruiting that you can say, "Hey, you understand why this site would be interesting to people like you" and so, that makes it a little bit easier. It's better than, you know, if you're doing a hospital payroll system, a technologist may not be as excited by it, but people who are already into open source technologies (people at OSCON) "renaissance programmers", interested in a lot of things and interested already in what Mark Mail, it is kind of perk that you can offer 'em, an attractive thing.
Mike Hendrickson: So, you know, one of the things that I've heard a lot lately is performance, performance, performance and, you know, with all these different languages and all these different developer type people working on Mark Mail, are you at all concerned as you scale up to that billion messages about performance? That - how are you gonna address performance in the future if you get really big and really popular and you have a kazillion people using Mark Mail? Are you at all concerned about performance?
Jason Hunter: We'd better be. . .
Mike Hendrickson: Yeah.
Jason Hunter: . . .if we're gonna do it right.
Mike Hendrickson: Yeah.
Jason Hunter: The good part is, working at Mark Logic, this is - I've been there for five years and I've helped on these sites that make Mark Mail look small. So, it's not my first time building something that's substantial and it's kind of what Mark Logic does. Our customers do big, fast sites. That's what they're interested in doing, so I'm bringing all that experience in so that we've architected it from the beginning, assuming that scale. And, you know, it's that shared nothing architecture where we can just add new hardware to handle the load. The way Mark Logic works, if I have more data, I can add a data management machine. If there's more users, I can add a particular user management, it's a valuation machine.
So, these clusters can get very large and as I add more data, I know that I can do what we've done with our customers and as we add more users, I can do what we've done with our customers. So, it'll be a long time before we have more traffic than I'm used to dealing with in some of the customer deployments.
It's just - this is kind of fun for it to be one of the ones where Mark Logic is doing it instead of doing it with our customer stuff. So, for me, it's exciting 'cause it's the first time I've owned the thing. I. . .
Mike Hendrickson: OSCON 2008, what was your favorite thing about this year's OSCON?
Jason Hunter: The best thing this year was meeting the people that I've been emailing with, all the project leaders on the various open source projects. As we load their communities, we go on outreach to understand Mark Mail does for them, what they could do with it. We - instead of just doing scattershot load as many mailing lists as we can, we try to kind of do all the Perl lists and really work with pearl.org to understand that. We load all the mailing list and work with them to help them make use of it; and so, being here, I've been able to meet a lot of the people that I've been emailing with and putting names and faces together, which is kind of neat. So, it's - at every conference, it's largely about the people. The talks are great, but then when you see the people and that experience has been really some of the best.
Mike Hendrickson: Excellent. So, one last question for you - if you had one book that it was very influential and you could say repaid itself ten times over, what one book would you say?
Jason Hunter: Should it be an O'Reilly book?
Mike Hendrickson: It can be whatever book you want it to be.[Laughter]
Jason Hunter: It actually is an O'Reilly book, so it's okay. The Steve Souders High Performance Websites book is my favorite book of the year because, what he does is explain how to make a fast site; and as much as we talked about how do you scale? More messages, more users, a lot of thinking there is, "How do I get the answer back fast? How do I not overwhelm my servers?" And his book talks about, "How do I make the user experience better by doing things on the page, figuring out how to deal with images, flash, CSS, Java Script", all the things that you do in these high end websites, and try to make 'em so, from the user's perspective, the time from click to action, is better. And there's a lot of interesting things in there - in the first book; he's working on the second book and he's got a blog and I'm reading it, so I'm kinda anxious. And this is the same guy that did wi slow, so you can run wi slow against this site and we actually do; and you say, "I need to be able to answer the question fast, but I also need it to look fast from the user's perspective", so we want to always get that answer, you know, in a tenth of a second, if at all possible, and his book's telling you how to do that soup to nuts.
Mike Hendrickson: Excellent. So, one last question - and I know I said one last question three times ago, but this is the last question. If I'm a user, how do I get more information about Mark Mail? Just Google it or what's the best way to get started with Mark Mail?
Jason Hunter: Mark Mail's at markmail.org.
Mike Hendrickson: dot org, okay.
Jason Hunter: And if you're - like let's say you're interested in Perl, then you can go to pearl.markmail.org; if you're interested in PHP, you can PHP.markmail.org, and that will preselect just those lists.
Mike Hendrickson: Okay.
Jason Hunter: And then you can search away - search for your name. Everybody loves to search for their name. See when you posted, where you posted, that kind of stuff and we have a FAQ and we have a feedback form, so if you like want your own list loaded, feel free to type it into the feedback.
We prioritize definitely based on what we're hearing from people, so if people are asking for it, we're more likely to load it sooner.
Mike Hendrickson: Excellent. Well, thank you, Jason.
Jason Hunter: Sure.
Mike Hendrickson: Thanks.
Jason Hunter: Thank you.