Web 2.0 Expo NY

...now browsing by category

 

10 High Order Bits from the Web 2.0 Expo in NY

Thursday, September 25th, 2008

10. Your Web App: Give it a REST

David Heinemier Hansson’s session about making Ruby on Rails RESTful cast this battle as an epic one between the REST Rebels and the Imperial WS-* Death Star. It’s going to be a tough fight but you know who’s gonna win.

REST (Representational State Transfer) is the elegant architecture and set of conventions first presented in Roy Fielding’s PhD dissertation “Architectural Styles and the Design of Network-based Software Architectures“. It is well aligned with the HTTP protocol and much simpler to implement and use than SOAP, XMLRPC, etc.

Implementing RESTful APIs in web applications is getting really easy with leading frameworks like Rails and Cake supporting REST as a first-class citizen. The Atom format is leading the charge as a RESTful format supported by the big players: Google, Microsoft, Yahoo, Twitter, etc.

9. “It’s Not Information Overload, It’s Filter Failure”

Clay Shirky’s talk states that since the invention of the printing press humans have always faced information overload. We have been surrounded by more information than we can consume in an entire lifetime for centuries. The problem is not information over load, it’s filter failure. We need better filters.

Jay Adelson of Digg believes building better filters is exactly the mission Digg and other players in the collaborative filter space are addressing.

8. Sensor Driven Data: The Web is Getting Orwellian

With Apple putting GPS in iPhones, Google putting GPS in Android, Nikon putting GPS in the Coolpix P6000, and … you get the point. GPS, motion sensors, video recorders, microphones, and other sensors are increasingly distributed and surrounding us.

Tim O’Reilly believes a BIG revolution is happening Here. Tim is really bullish on sensor driven data. Where 2.0 has its own O’Reilly Conference. This space is heating up fast.

7. Javascript is Bringing Sexy Back to the Browser

John Resig gave a session on processing.js, his visualization engine running atop the HTML5 Canvas. The canvas has really low level functionality, a la OpenGL for 2-D surfaces, but with the right libraries in place it can lead to some truly impressive results. Flickr’s Paul Hammond gave perhaps the most compelling story of the use of Javascript and Canvas. After building Flickr Stats, using Canvas for graph visualizations, a team member loaded a page on an iPhone. It just flat out worked.

Unfortunately my friends over at Microsoft are slowing down the progress here with no planned support for the HTML5 Canvas in IE8. Google’s excanvas gets around this for IE users by mapping to VML. Unfortunately excanvas currently only works in quirks mode in IE8. Damnit Microsoft, you’ve brought IE8 a long ways towards being a modern, friendly player on the web, why not support Canvas? Come on Oz, come on.

6. The Open Web is Nearing the Tipping Point

DataPortability co-founders Chris Saad and Daniela Barbosa gave a great session on the basic motivations behind the movement. The future the DataPortability group is trying to create, one which allows us to owning our data, our contacts, our relationships, etc. and be able to move them freely and easily between the on-line systems we use sounds truly empowering. The big players are joining the party: Microsoft, Google, Facebook, Six Apart, Linked In, Yahoo, Digg, Plaxo, MySpace. But Chris says “Who cares about them? This is a grassroots effort!”

Joseph Smarr, Chief Architect of Plaxo, gave another interesting session on the major components of the open web and how they fit together. OAuth, OpenID, Open Social, and others were covered. The feeling I walked away with is that we’re a lot closer than I thought.

5. Web Scalability thanks to Async & Danga

“You can’t drop something in 40,000 buckets, synchronously, at once”, said Digg’s Lead Architect, Joe Stump in his session “Scaling Digg and Other Web Applications“. He was referencing what happens when Kevin Rose posts a message on Twitter. (Rose actually has nearly 65,000 followers on Twitter) Asynchronous task queuing is how the folks at Digg, Twitter, and Flickr deal with problems that are really hard to do in real time in any scalable fashion.

Just about all of Brad Fitzpatrick’s (of LiveJournal and OpenID fame) lightweight systems software, freely available at Danga.com, seems to be used by the biggest Web 2.0 players to achieve scale. That memcached, gearman, perlbal, djabberd, and mogilefs, all came out of Fitzpatrick and Danga is just incredible. No wonder Google gobbled him up from Six Apart.

4. Web 2.0 Traffic: It’s Out-of-Band

The knowledge tidbit that stuck out more in my mind than any other was that Twitter gets 10 times the amount of traffic from its API than it does through its website. It makes sense, I’d just never acknowledged it explicitly. Dion Hinchcliffe’s workshop painted a similar story for many other Web 2.0 successes. The canonical example is YouTube with the embedded video. The decision to put html snippets plainly visible, right beside of the video, was perhaps their most genius move. Modern web applications and services are making themselves relevant by opening as many channels of distribution possible through feeds, widgets, badges, and programmable APIs.

3. Cal Henderson’s PHP Tent Revival

If not for Cal Henderson I may have never have touched PHP again. I’m probably going to come back to this topic in more depth in a future post but Cal’s workshop “Scalable Web Architectures: Common Patterns and Approaches” renewed my interest in, relationship with, and respect for PHP. The funny thing is that wasn’t even the point of the talk. Cal and Joe Stump of Digg’s succinct point that Langauges Don’t Scale is right on. Sure PHP isn’t as beautiful, trendy, or well designed as Python or Ruby are. However, some of the design decisions made by PHP’s Rasmus, specifically the ’shared nothing’ness, make it a great technology for web applications. There’s a reason why Facebook, Digg, Flickr, and co. are still on it.

After Cal’s workshop I asked him: if you could do it all over again with Flickr would you choose to go with Python or Ruby? Cal’s answer: Nope, I’d do it in PHP.

2. Set Your Baby Free

By grooming and nurturing a web app internally for an extended period of time is you lose a lot of value. Jason Fried’s notion of “half a product is better than a half-assed product” is so fitting here. Sandy Jen of Meebo echoes similar notions in her talk: Start out with something simple, see if it works, evolve. Bring your customers into the feedback loop as quickly as possible. Joshua Schachter, founder of delicious, spoke of the exact same sentiments in his talk on “Scaling and Building Social Systems“.

1. Want to Set the World on Fire? YOU Better Bring the Fire.

If you are not bringing the heat, get out of the kitchen. Passion was the common thread amongst the most inspiring talks I saw at the conference. Between Gary Vaynerchuk, Jason Fried, and Arianna Huffington the message was  consistent: be passionate. I’m going to let Gary roll this one out with his amazingly energetic keynote on building personal brand…

Joseph Smarr – Tying it All Together: Implementing the Open Web

Friday, September 19th, 2008

[Live from Web 2.0 Expo 9/16 - 9/19 Follow along the other Expo Talks in RSS.]

Joseph Smarr is the Chief Platform Architect at Plaxo.

Joseph Smarr

Joseph Smarr

Lots of open source building blocks for bringing things together. How do all these pieces sit together and what is the landscape going to look like when the dust settles?

The social web today is very broken. On each site you have to re-create an account, re-enter profile info, re-find friends, re-establish relationships. New social apps have limited options: create yet-another-silo and start from scratch or make a widget inside of an existing walled garden.

There’s got to be a better way and there is. Help is on the way. It’s coming in the form of new building blocks that establish: who I am, who I know, and what’s going on. We’re going to aim for the medium level of detail on each of the projects which fit into these building blocks.

Who I am: Creating a portable, durable online identity. OpenID is important in this space. OpenID lets you come to a new website and allows you to log-in with an account that exists on another site. You can sign up and sign in with your existing account. You can then link and share your profile data between sites. When you go through Plaxo’s sign up you can sign in with any open ID. This takes you over to your identity provider and allows you to verify that you want to share your information with Plaxo. This is good for users and for Plaxo by reducing friction. Yahoo is OpenID, MySpace is on the way, AOL is signed up, some of Google’s properties are supported, this has majorly caught on.

Consolidate your online identity with me-links for rel=me (XFN). The social graph API allows you to query Google using REST for the downstream me links. This makes it easy to find out more information about users by what exists on the web. Again great for both the consumer in not having to duplicate info and great for businesses in terms of getting data into your systems.

Who I know: You need to be able to build and maintain relationships. Until recently the only way you could get at this information was to scrape your webmail address books. It’s kind of hacky and insecure. The good news is over the last year that this isn’t going away, it’s useful, and they’ve made it easy to practice safe portability. Google, Yahoo, and Microsoft have mechanisms for getting at the information without giving a new service your webmail password. OAuth is a means for sharing private data between trusted sites. A bunch of people came together and came up with a standard way of getting at data. OAuth is supported by Google, MySpace, it’s a part of DataPortability. OAuth gives a third party site a token which is revocable. It can be scoped access. Friends-list portability allows for continuous discovery across multiple sites.

The Open Stack

The Open Stack

What’s going on? Because the entire web is becoming social you’re creating and doing interesting things on a lot of different sites. You can’t walk each site to check and see who is doing what. OpenSocial is trying to define a standard language for social networking applications on the web. You can drop in widgets that work on all social networking sites. OpenSocial is going mainstream and has over 500 million users by the end of the year. Everyone is agreeing on standard APIs at the server-to-server level. RSS and Atom is another important piece which is often overlooked. It’s an important standard for sharing “here’s what going on right now.” If you put RSS together with OAuth you can get private update feeds. Jabber XMPP is becoming more important, too, it started as an open standard for instant messaging. One of the things they built in as a result is that it is federated. It’s a good set of open tools for different sites sending messages to each other.

This stuff is out there, it’s real, and it fits into these standard blocks. What we’ll do now is pull everything together.

How does the friends list portability work?

  1. Tell the site your social graph provider: XRDS-Simple (discovery) + OAuth (access)
  2. Site fetches your data to find local friends: Site fetches your data to find your local friends ??? No standard way to do this yet. A project that’s going on and in draft spec is still up and coming is PortableContacts.net.
  3. Site lets you connect to people you want: You can periodically look for new matches.

How does contact portability work?

  1. User signs in with an OpenId: Site fetches OpenID URL -> looks for X-XRDS-Location, Site parses XRDS-Simple doc to discover available APIs
  2. Site tries to access contacts API -> gets a 401: WWW-Authenticate response header specifies OAuth, OAuth discovery (via XRDS) provides OAuth endpoints
  3. Site sends user through OAuth

A resource Joseph wrote on OpenID: http://www.plaxo.com/api/openid_recipe

Jeffrey Zeldman and Panel – Content Matters

Friday, September 19th, 2008

[Live from Web 2.0 Expo 9/16 - 9/19 Follow along the other Expo Talks in RSS.]

The Panel:
Liz Danzico – Bobulate.com – part information architect, part usability analyst, and part editor.

Alex Write “Information Architect” – NyTimes.com – Previously a UX designer and journalist in California.
Bre Pettis – “Videographer” – BrePettis.com, Etsy.com
Kristina Halverson – “Content Strategist” – BrainTraffic.com
Jeffrey Zeldman  – “Evangelist” – AlistApart.com, HappyCog.com, Zeldman.com
Paul Ford – “Editor” – Harpers Magazine, ftrain.com, themorningnews.com

Liz Danzico Setting Context for Panel Discussion

Liz Danzico

The people on this panel are interested in changing your minds on the role of the content in design and user experience mocks. The original name of this talk was: copy matters.

You may have heard these two conflicting views: 1. Content drives traffic. Content certainly is a primary reason for users to come to you. 2. Users don’t read on-line (Jakob Nielsen).

We will be talking about whether we are at a cross-roads. Going back a little bit in history to transitions in media, people referred to television as radio with pictures. When movies and television were first made cameras were stationary. Early MTV was just bands on the stage playing. When we make transitions from old to new we borrow metaphors from old technology and apply them to the new. As we look at the web as a publishing medium we’re looking at a different publishing medium. Different responsibilities for makers, editors, etc.

We aren’t writing, we are speaking in text.” – Erika Hall of Mule Design.

The internet looks like writing, but it’s actually a conversation.” –Khoi Vinh, NYTimes.com

What kind of content are we talking about in the panel?

Navigation & Orientation content – Daytum.com, Flickr.com with rotating greetings setting the tone in different languages. Navigation on CNN is an example of clear communication on sub-pages. Matched with URL and page hierarchy.

Labels & Action – Vimeo with its labels. Geni.com as a quick review of what labels are.

Help Content – Tick “Just kidding I Remember Now” link next to “Send me my password”.

Non-textual content – Visual content like election maps, photos, info visualization.

Content, content, ! – Marketing communications sites like Business Week with Editorial Content, etc.

With this new publishing model and these new types of content, how are we going to make it work? Ask the experts!

Q) What is the nature of the content work you do? The reason I asked each of you to the panel is because each of you has a different role in content.

Halverson- Typically our clients are dealing with content that helps discuss their products and services. We are trying to help them wrangle, plan for the creation, creating, standards and structures in place to help them govern that content. I got into content strategy because I was handed wire frame to fill content for websites.

Pettis- I got into video and videos about how to make things. For two years had a show called video projects. I’m a video guy who thinks print is a part of the past.

Jeffrey Zeldman

Jeffrey Zeldman

Zeldman- With the magazine I write, with my website I write. I was a journalist and in advertising and copywrite for a long time. When I started websites back in 1995 the whole thing as that it was self publishing. I thought everyone was going to learn HTML and be self –publishing. I thought everyone would find their voice and try to find an audience. With our client services projects we always start with the content, what it is, what’s there, how they’re going to interact with it. We develop content strategy and architecture before we get to design. If you bring design in fairly late in the process you’ve already worn them down. In terms of the magazine it’s a labor of love.

Q: How do you deal with accommodating both organizing content that is purely visual and content that is textual?

Write- NYTimes publishes a lot of content and we increasingly publish a lot of multimedia content: images, slideshows, interactive flash, etc. Issues that come up is around the metadata layer. All of our photos are in a huge database that exist separately from the article database and content management system. We have a good taxonomy to tag articles but we don’t have the same capabilities with issues. From a design point of view we’re constantly trying to figure out how to weave that content into the site.

Pettis – The thing I’m excited about on the internet is that people find a passion and get into it and publish about it. I’m on the blogging team at Etsy. We have really passionate users.  What we do with the blog is open it up so that anyone can pitch ideas. We have over 300 authors on the blog. We have a video team of people who want to point cameras at things and record what they’re thinking and doing. It’s a way of sharing passion and excitement.

Q: How are you helping your clients how to become sophisticated publishers of content?

Halverson – That for me is way down the tracks. As an example take a company with 12 different business units serving 122 different markets. Producing content for a lot of audiences. Our process is A) Figuring out where the content is and who is out there B) establish who is publishing content, reviewing content, etc. C) governing consistent brand standards across the content. There’s a complete infrastructure lacking within many organizations between print marketing and interactive marketing. We start by trying to bring these people into the same umbrella. That challenge in some organization is really difficult. They spend all their money on brand and have no idea how to govern and create content.

Zeldman – I would just like to say it’s mostly luck. Like woody allen said ablout love: sometimes you’re lucky sometimes you’re not. With clients we would turn things over and sometimes the client would use it and take of running, sometimes they wouldn’t. We built a content management system and would write guidelines and sometimes clients would follow them and sometimes they wouldn’t. We’re doing something for a food manufacturer and they make a delicious bar which has a cool brand and has medical implications so the challenge for copy is that there are pages that have to address people with lupus and there are other birds where there are birds with funny sayings. What rules do you give the client for when they use each tone, how to transition? We create matrices and recommendations and if we’re lucky the client has the right people and the right talent to keep it going. You hope that everyone is passionate about the project.

Question: How do you approach content from a user generated view point?

Zeldman – I think it’s both. You have to talk about both sides of the equation.

Liz – More and more it’s our responsibility as designers to think about creating very good frameworks that are well thought through, intuitive, and provide intuitive roles which people can participate in. That’s one step: designers create a framework to participate. The second step is for users to actually be involved. The third is an editorial responsibility of the client to monitor content.

Zeldman If you abandon the responsibility of editorial control you lose a lot of value. If your content is of high quality you’ll get comments of high quality. Generally because the writing is so good at NYTimes there are some really well thought out comments.

Write – When user generated content works well it’s when it’s well channeled. If there’s a cacophony of noise you can’t get anything out of it. At NYTimes you can comment on certain articles but it’s all moderated by a team of people that try to keep the level of discourse civil. The notion of just opening things up and letting people go after it leads to craziness.

Ford – Another example: how many wiki sites are dead on the web right now?

Zeldman – The first site I worked on for a client was Batman Forever in 1995. We had a forum and seeded some content. It went well initially, people used it. But once we were off the job and the movie stopped needing to sell tickets so they stopped keeping track of the forum. Suck.com later did a piece on it. People were making racist comments and trying to have sex with each other.

Ford - You can get content for free but you can’t get editing for free.

Halverson – We’ve had companies who try to fix their content problems by buying a really expensive CMS. They think the magical content will just arrive. You must plan for, create, and govern content.

Zeldman – Even Flickr is about constraints. They encourage a certain kind of user.

Halverson – Good example but an easy example because the site is for fun. It gets more complicated when the matter is more serious.

Q: What kinds of tips can you give to people who are responsible for creating content?

Write - Part of the design process is what are the words that exist on the page? “The vacuous victory of typesetters over authors.” People tend to think of the web as boxes and content blocks.

Bre Pettis

Bre Pettis

Ford – If you are the one doing content make it very easy to get tons of feedback. I’m the sole guy doing web copy on Harpers and I hear back from customers constantly. When you are the person doing the copy it’s your job to make it as straightforward as possible.

Pettis – I’m shocked to hear businesses farming out content and passion.

Halverson – I often work with people who are responsible for content on top of their many other jobs. Don’t conceive of and put boxes on your wireframes for content you don’t have time to create and govern. Scale back.

Zeldman I think that’s the most important piece of advice I’ve heard today. Scale back. If you don’t have the people to do it, you shouldn’t do it. Grow slowly.

Jay Adelson – Organizing Chaos: The Growth of Collaborative Filters

Friday, September 19th, 2008

[Live from Web 2.0 Expo 9/16 - 9/19 Follow along the other Expo Talks in RSS.]

Jay Adelson is CEO of Digg, guiding all aspects of the company’s development, growth and management. Under his leadership, Digg has grown to 26 million visitors per month, and is now considered one of the top socially focused Web sites.

Why do collaborative filters matter? How many of you used google? How many of you have used Digg? Any time you take the interests of a group and use that to filter and create relevance for an audience and a group then that is collaborative filtering. Even search is a sense of collaborative filtering, just think about backrub and page rank or clicks on a search result. This has evolved.

So what’s changed? Now you’re on the web 24 hours a day. In 2003 Berkeley said there were about 2.3 million sites added every day. Now there’s about a terabyte a day added to the net. This data is dynamic. Privacy and sense of privacy has also changed. Younger generation doesn’t have the same issues associated with privacy that we have and our parents have. How I use my away message on AIM, “I’m at lunch”, whereas my teenage baby sitter’s will say “I’m feeling down” “I’m full”. We are moving from a seek culture to a connecting culture.

Let’s break social filtering down into three parts:

1) something like a Digg or a Zeitgeist is the same for everyone.

2) social networks where I create a subset of groups with just my friends. I can’t use my friends as a judging factor for what might be interesting to me.

3) The exciting thing, the point I can leave you with today, is the hyper-personalization opportunity. Instead of looking at a social network, look at everyone and pair you with people like you and use that collective wisdom that are more specifically interesting to you. Since your personal data is going to move from website to the next you have to think about how you can take that information and deliver experiences specific to individual users. Collaborative filters are the key to the monetization to Web 2.0 applications

Arianna Huffington in Conversation with Tim O’Reilly

Friday, September 19th, 2008

Arianna Huffington is an author and nationally syndicated columnist in the United States. She is the founder of The Huffington Post, an online news/commentary website and aggregated blog.

Tim: How did you get where you are? People look and think you’re at the center of the new media world.

Arianna: The secret is you must have passion. I fell in love with the web when I realized that people without a platform could have one on the web. I love the obsessive/compulsive nature of the web. Even if you don’t own a printing press or the history of a hundred years in business you can reach an audience. We wanted to do three things: we wanted to have an attitude, we wanted to provide a platform for other bloggers (we now have over 2,000), and we wanted to have a community which is central to what we do.
From day 1 we made sure the comments on the blog were pre-moderated. There is no substitute for pre-moderation. We now have 30 humans who pre-moderate around the clock all year.

Tim: Do the moderators actually participate in discussions?

Arianna: For now all they do is delete ad-hominem, violent attacks.

Tim: Describe your process. I remember a movie about Watergate where there was a huge debate as to whether the news should be front page or not. Do you guys have staff meetings where you debate what news goes front page?

Arianna: We’ve got an editorial team. We’ve got a guy who decides the splash headline. We still believe Iraq is the biggest disaster in the history of US Policy. Even when mainstream isn’t following it anymore we are because it is such a catastrophe.

Tim: So you’re kind of like Rupert Murdoch, you don’t necessarily want to be impartial but can promote causes.

Arianna: There is a huge difference between us because the news we run is based on facts. When you say you’re giving both sides of the story and one side of the story isn’t based on fact it’s not newsworthy to present it.

Tim: How do you read the current political situation?

Arianna: You know, it’s really interesting. Last week we had this amazing phenomenon of the media being distracted. The selection of Sarah Palin was a little bit like a soap opera. People obsessed over all these small stories around her. Did she really sell her plane on eBay? Then, suddenly, reality set-in and the house is on fire. We’ve woken up and I’m glad about it. I wrote about it and felt that Sarah Palin was a Trojan Horse. I really feel she is a major danger.

Tim: Let’s talk about financial deregulation. This is a house that’s currently burning. What do you think about what brought us here?

Arianna: What brought us here is the illusion that you can have free markets, unregulated bring about public good. Look at the 85 billion we’ve agreed to put into AIG. America is basically telling the people: if you are big enough you are not going to be allowed to fail. But if you’re an ordinary American and your house is foreclosed then you’re on your own. This isn’t what America used to be.

Tim: Coming back around to Huffington Post as a new media phenomenon. You’re in New York, not California, what has that brought to you?

Arianna: We love being in New York. It’s been great for us to be infused in the energy of the city. It’s been a great place for us to recruit young, driven editors. We’ve found great people to work in our technology department. The key is to be surrounded by people who constantly want to invent and reinvent. We are constantly creating new technological tools, bringing in new video, all of it is a part of keeping people engaged.

Tim: Why do you think the conservative blogs have failed and progressive blogs are succeeding?

AriannaConservatives do so well on talk radio because they’re great at being blowhards. It’s because they can talk without being corrected or checked on facts. They don’t have to speak in truth because no one is pushing back. Progressives do well on the internet because on the internet you have the masses checking your facts and ensuring you’re on top of the truth.

Audience Member: I’ve noticed in America we’re engaging the Islamic world. On the other hand our media organizations lacks people who are Muslim American writers.

Arianna: First of all that is something we’d like to get better at. If anyone wants to write on any subject of any creed just shoot me an e-mail. It doesn’t have to be politics. Half of our traffic isn’t politics. It could be on any subject.

Tim: Do you think the internet will bring greater transparency?

Arianna: As you know, radio was a great tool for fascism. I believe the internet is a great tool for democracy. We see the way Obama is using the internet. Without a doubt without the internet Barack Obama would not be the nominee. It’s increasingly become clear that political leaders need the community to be pushing them to do the right thing. People fascinate themselves with polls but they’re awful. At Huffington Post we’ve got this new feature called Pollstrology.

Tim: Could you speak to any of the trends you’re noticing as you bring groups together at the website?

Arianna: There is something I’m noticing which I believe is going to be the next big thing of the internet: there is a huge need and longing to unplug and recharge. Our subhead is unplug and recharge. If all you care about is success and power, sex and money you are still incredibly better off if you unplug and recharge. Find an oasis.  We have 2-5 minute breaks.

Tim: What do you do to recharge?

Arianna: I personally do yoga, I hike, I get enough sleep most of the time. If you have any ideas for how to recharge let us know, send them to me.

[ Follow the Feed for notes on talks from other web leaders & innovators at the Web 2.0 Expo in New York going on this week. ]

Sandy Jen – Scaling Synchronous Web Apps: Lessons Learned from Meebo

Thursday, September 18th, 2008

[Live from Web 2.0 Expo 9/16 - 9/19 Follow along the other Expo Talks in RSS.]

Sandy is a co-founder of meebo in Mountain View. She majored in Computer Science at Stanford. Sandy is the ‘Server Chick’.

Things to keep in mind about scalability: what works for someone else won’t necessarily work for you. You know the most about your stuff. Don’t hire consultants. You built your app. Don’t get married to a technology but don’t be a total flirt. It is a very high cost to rip out your guts and start over again. Remember that this is supposed to be fun. You’re building a product. You always have a customer who will be happy that you’re building this thing for them.

Synchronous web applications are very different from asynchronous web applications. Traditionally “async” implies “more complex”. On the web it’s opposite because the browser is meant to be async. If you’re going to build a synchronous app you’re probably taking something that used to be on the desktop to the web. That’s where all the problems of scaling come into. Meebo is a good example of synchronous on the web, so is Gmail, many games out there.

Doing synchronous on the web is like trying to fill a square hole with a round peg. The hole is that there are a lot of platforms (OS, browsers, etc). When we test a release we test on all the OSes and the browsers and safari and now chrome. Spotty network connections are not going to be 100% stable, still people using dial-up. The limitation of only being able to have 2 open http requests allowed imposes a serious constraint. It’s hard to measure how successful a synchronous app is based on traditional page view metrics. Alexa doesn’t pick up how much traffic Meebo actually gets in relation to other page view based sites.

The peg, the thing you’re shoving into the hole, is the need for instantaneous data transfer. Long polling is challenging because you’re using resources on both the client and the server. This is making the browser do more work. The user experience needs to be seamless and feel fast, light, and feel better than the desktop equivalent.

What is synchronous? What has to be synchronous? What doesn’t have to be synchronous? The more you try to dump into synchronous the more trouble you’re going to have trying to scale it. Sometimes you cheat in order to create the seamless user experience.

Find the right holes for your pegs: don’t underestimate server side architecture! Type of app determines the type of synchronous scaling. Bottlenecks can be anywhere: memory, CPU, bandwidth, storage, disk i/o. Based on the type of app you’re building it will be in different places. You won’t know where all the bottlenecks are until you let it loose. With Meebo it went from one to another to another. We solved the memory problem a while ago and it’s come back since then.

Things that people say are great, synchronous helpers: long polling (COMET) connections without having to poll every 5 seconds. Meebo started with Apache and it wasn’t good for us, so lighttpd was a much better fit. When it comes to compiled vs. interpreted languages it goes either way. We use C but it’s kind of a bitch to hire for because there aren’t many people doing it any more. Databases can be really expensive or really cheap. Start simple and if you need to get more complicated do it when you need to. Memcache is great, we’ll talk about it more later. Load balancers, finally, are just really expensive and you have to buy in pairs.

Simple is better unless you’re rich. First question: what am I using it for? Am I using memcache because everyone else is? We tried it at Meebo but it turns out most of our data wasn’t cacheable. What I gaining? Scalability at the cost of maintainability? Can I use DNS round-robin instead of load balancers?  FastCGI vs. web modules vs. PHP? When we first started we didn’t want to reinvent the wheel. We started out really simple with CGI written in PHP. We wound up just writing a module directly into the web server and that’s what we’re still doing today. Start out with something simple, see if it works, evolve. Do I need to save state? Is it persistent? Can I store it in a cookie? Meebo didn’t have user accounts for a year. Launching feedback light is not a bad thing.

There’s a constant tug of war between the front-end and the back-end. Whose bug is it anyway? You have to figure out where the workload makes sense. The browser can be really slow. Most of Meebos users use IE. Say you’re using a web request and you pass a lot of data down to the client to process it can really bog down the user experience. Pick one, release it, and ask if it’s slower or faster than the last release. Your users know more about your product than you do. Listen. Efficiency with data transfer: when we first started out I picked variable names with single letters to save bandwidth. Once we started hiring it was confusing.

Must find a balance between good enough vs. perfect. Perfection is enough simplicity in the system to allow for adaptation. Users don’t care how clever you are, they just want their product to work. Long polling isn’t perfect, browsers have quirks. Sometimes perfect is not good enough (look at Ruby!). Release enough and things will asymptotically approach perfection. Don’t be afraid to try things.

Think ahead but don’t think ahead too much. A great example of this is security. You can spend a long time trying to fix security holes but if your product never ships, who cares? Over designed code is hard to roll back from. Hacky code can work and not be so bad. When you first build you won’t know where you’re going to need to scale so over-thinking the problem is a waste. It’s all about balance.

Nothing simulates real life. Have contingency plans on both front-end and back-end. Don’t build flood gates, build dams: One time we rolled out a feature that took a huge amount of bandwidth and we were able to switch it off. When you roll out features be very transparent with your users and say “hey try this out, let us know what you think”, they’ll get a lot less upset when you have to roll it back.

Be a user of your own product. Don’t be afraid to break your own product. Stay in the loop of your community and stay in touch with the pulse. What is your firefox/ie breakdown? 70% of Meebo’s users use IE. When we use when we use Meebo? IE.

It’s ok to be “Big Brother” in the sense of being aware of what’s going on. Monitor key areas but don’t go overboard on monitoring, you’ll learn to ignore your alerts. Ignoring what systems are telling you in feedback mechanisms are dangerous. Monitoring is being aware of how healthy your system is at any given point. Can I log in? What is our downtime percentage?

Final thoughts are that there are no magic solutions to scalability. It’s important for you to know your system like the back of your hand. Correlate effects to the changes you’ve made in your systems. Do not lose sight of your goal: why are you scaling? Finally, remember, everyone scales differently!

[ Follow the Feed for notes on talks from other web leaders & innovators at the Web 2.0 Expo in New York going on this week. ]

Chris Saad, Daniela Barbosa – Understanding the Basics of Personal Data and DataPortability

Thursday, September 18th, 2008

[Live from Web 2.0 Expo 9/16 - 9/19 Follow along the other Expo Talks in RSS.]

Chris is Co-Founder and Chairperson at DataPortability Project. Daniela is Chairperson, Steering Committee and Co-founder at the Dataportability Project.

There have been many tech inflection points. Intel gave birth to the standardized PC architecture. Windows became a standard for GUIs, drivers, and all the plumbing which allowed for a whole new class of applications to be built. TCP/IP led to the standardization of the the internet. HTTP/HTML brought us the web. We’re moving further and further up the step and we’re getting to the point where we should be thinking about standardizing on the data.

The data portability video was done very early on. You’ve got data everywhere, accounts everywhere, profiles everywhere, friends everywhere, contact details. Upload your photos, avatars, music, rinse, and repeat. Again and again and again. Network fatigue. Your data locked up in someone else’s hands.  DataPortability is all about creating a web where information can freely flow through the network.

Imagine owning and controlling your relationships. Imagine controlling your calendar, images, and other content. Today we all join Flickr, Digg, Twitter, Facebook, etc. Imagine instead that these applications joined you and that you are the data. These applications had to ask for access to the data. Why? So you can sync your friends between Twitter and Del.icio.us and keep them sync’ed. Go to Kodak.com and print your photos from Facebook.

Today we share we comment, we rate, we create. Users sign up, fill out profile, add friends, interact with your stuff, you make money, they share your stuff, you get more traffic. Everything but interacting with your stuff and making money is friction that is repeated at every web site. Why the data lock-in model is actually a myth. Example: Amazon.com has a lot of data on you on their site. Why in the world would they want to release this data? It is because Amazon only has a small slice of the data. They don’t know the searches you do on Google. The data they do have is expiring rapidly. In a DataPortability system you get more data, reduced network fatigue, and more usage of everything.

DataPortability means you can access and synchronize the data you have between multiple services.
In January Robert Scoble, with Plaxo, scraped his contacts from Facebook. This is what we refer to as Scoblegate. It set off a flurry of discussion around the idea of being able to move your data between websites and applications.

DataPortability, as a group, is not writing any code or providing any code as a group, we’re advocating a stack. There are lots of open questions around security, privacy, ownership, business models, and user education. How can we evangelize and educate for the user?

Many of the big vendors are playing: Microsoft, Google, Facebook, Six Apart, Linked In, Yahoo, Digg, Plaxo, MySpace. But who cares about them! “If you’re out there I don’t mean you specifically.” (Haha) The DataPortability Project is an open, grass-roots effort.

We have a governance model where everyone can join and get involved immediately. We are experimenting with a radically transparent leadership model. Everyone will own this because everyone has helped to build it. It’s not a product, nor a service, it’s an idea. That means YOU, yes YOU.

Joe Stump – Scaling Digg and Other Web Applications

Thursday, September 18th, 2008

[Live from Web 2.0 Expo 9/16 - 9/19 Follow along the other Expo Talks in RSS.]

Joe Stump is currently the Lead Architect for Digg where he spends his time partitioning data, creating internal services, and ensuring the code frameworks are in working order.

Digg by the numbers: 30,000,000 Ron Paul fans. 13,000 requests a second, bunches of servers.

“Web 2.0 sucks (for scaling).” Web 1.0 was easy where we had this landrush of just getting content on-line.

Web 2.0 somebody had a bright idea that we would turn content over to the users. The problem is people like creating a lot of shit. Web 1.0 was easy to scale because I only needed to worry about a could hundred thousand some records. Now we’ve got a lot more to worry about. Another thing I hate is AJAX which makes interacting with websites really easy. It gives users the ability to create shit even faster.

Making your PHP code 300% faster doesn’t matter, it’s not where your bottlenecks are. “PHP Doesn’t Scale” – Cal Henderson. PHP doesn’t scale, Java doesn’t scale, Ruby doesn’t scale – languages don’t scale. When you’re worrying about scale and storing 4 billion kitten photos: how you program it probably doesn’t matter.

What’s scaling? Scaling is specialization. As you get bigger and as you grow the solutions being sold to you by vendors won’t cut it. You have to cut your database into different pieces and make it very specialized and specific to your needs. We’re going to talk about some of the techniques we use at Digg. Scaling is also about severe hair loss. I’m not joking. I’m going bald. It’s tough. It’s not easy. You can’t do it alone.
Often people get confused with scaling out and scaling up. You get to a point where you can’t scale up anymore. You can’t just buy more expensive machines at some point. Everyone is scaling out right now with lots of crappy boxes. We expect to fail.

Your mom lied; don’t share. Decentralize, expect failures and just add boxes. Amazon is one of the best at this.

CAP Theorem says you can only pick two of the following three: strong Consistency, high Availability, Partition tolerance.

What are my options? Denormalize, eventually consistent, parallel, asynchronous, specialize.

Denormalization is necessary in partitioned solutions and it’s becoming a huge problem for Digg. If you’re not using queues and messaging systems you’re going to want to look into gearman and djabberd. You wonder why things are going slow and you realize you’re doing 5 synchronous trips to the database. You’ve got to make these calls async with either http calls or gearman. One thing Digg is big on is running the numbers before you try and fix a problem. Run the numbers to make sure things actually will work. We’ll discuss a case of this.

Memcached, OMG Files! (MogileFS) Digg uses for icons and photos, Gearman is a massively distributed fork, and the new favorite toy: MemcacheDB “Will be the biggest new kid on the block in scaling.” Initial tests on a laptop yielded 15,000 writes a second. The developer behind this took Berkley DB and Memcache and brought them together.

Caching techniques: cache forever and explicitly expire, have a chain of responsibility. We had a generic expiration time on all objects at Digg. The problem is we have a lot of users and a lot of users that are inactive.  Chain-of-Responsibility pattern creates a chain: mysql, memcache, apc, PHP globals. You’re first going to hit globals, if it has it you’ll get it straight back, if not go to the next link in the chain, etc. Used at Facebook and Digg. If you’re caching fairly static content you can get away with a file based cache, if it’s something requested a bunch go with memcache, if it’s something like a topic in Digg we use apc.

Partition your data horizontally (rows a-f on one machine) and vertically (some columns on one table, some on another table). Horizontal when you have so much data you need to spread it across a lot of servers. Vertical scaling: Instead of altering tables, add a new table and add new columns to it, this avoids downtime. Abstract your data access so that the partitioned details are hidden from the user.

Green badges at Digg are the bane of Joe’s existence. Similar problem to what Twitter and Digg have. If you take a message from one place and drop it in a bunch of other buckets.  Kevin rose has 40,000 followers. You can’t drop something into 40,000 buckets synchronously. 300,000 to 320,000 diggs a day. If the average person has 100 followers that’s 300,000,000 Diggs day. The most active Diggers are the most followed Diggers. The idea of averages skews way out. “Not going to be 300 queries per second, 3,000 queries per second. 7gb of storage per day. 5tb of data across 50 to 60 servers so MySQL wasn’t going to work for us. That’s where memcachedb comes in.” The recommendation engine is a custom graph database from the R&D department and is eventually consistent. An example of problems you run into at real big scale on a social website.

[ Follow the Feed for notes on talks from other web leaders & innovators at the Web 2.0 Expo in New York going on this week. ]

Andrew Turner, Mikel Maron – Trends and Technologies in Where 2.0

Thursday, September 18th, 2008

Notes from the Web 2.0 Expo – NY talk given by Andrew Turner and Mikel Maron.

As people are going out and gathering information on their own we’re collecting a lot of geo-aware data. This is becoming a really hot area. Nokia and TomTom just made big acquisitions. Every Web 2.0 service is starting to add location. You can start mining this information with tools like geocodr.

How do you start gathering them together? We started a company called geocommons. We’re taking this massive amount of data and trying to pull it all together. It’s an open database of freely available data with creative commons license. You can see where it came from and who posted it. You can search the data.

What about when your communities are supplying a lot of data? In Detroit the city is geocoding walking trails. With Hurricane season there are lots of people geocoding where shelters are. A local NYC company social lite is doing place marking with bars using mobile web. Android has a lot of applications which are innovating on the geo aware capabilities of the phone.

Mapvertising is one way in which people are trying to make money in this space. But it’s hard. You don’t want to do a search for a romantic restaurant near you and get back a Hooters advertisement.

Once people are sharing all of this there is a problem of privacy. Flickr is looking at casual privacy where you can set who is allowed to see where your photos were taken. Fire eagle is a location brokering system. If you trust Yahoo! they can be the trusted holder of your location. You can specify which sites get which granularity (only zip code, for example) of knowledge of your data.

NeoCartography sites like EveryBlock and is trying to focus on the data as opposed to the street. You can look at and understand the area based on data. OpenCycleRoute allows you to re-render a map with the best routes for bikes.

We’re launching GeoCommons Maker in a couple of weeks which allows you to create proper maps.

Burning Man Experiment

You may think it’s a bunch of naked hippies in the desert blowing things up, and it is, but it’s a whole lot more than that. You show up in a desert and within a week you have a miniature city. It provides a canvas for trying out upcoming geocoding technologies.

We collected over 100 gigabytes of data over the week. These are early results. Why is this important for 2.0 Expo? This is a look at what these technologies can enable. Burning Man Earth was a really interesting iteration in Where 2.0.

We took remote sensing data every day. We used pictearth.com, diydrones.ning.com, openaerial.com. We got a small plane with a camera. We were gifted 200 Gallons of Fuel. We took a flight path every morning. Really cool pictures of how the event evolves day after day.

Burning Man Map

Processing with ERMapper, ESRI, Photoshop, Sweat. You also are recording with a GPS device which gives you where each photo is centered. Takes a lot of sweat to get everything lined up.

Underlying all of the projects we did is the GeoDjango platform. GeoDjango provides facilities for doing mapping. You can do interesting queries on geographic data pretty much for free.  We took the PDF view of the city map, rasterized, and did rectification. Used ESRI, WMS, and Tiling software (TileCache). OpenLayers is an opensource mapping library in Javascript, has editing tools which allow you to draw over a rectified map.

In the Future we’re going to turn this into a social application.

OpenStreetMap (OSM) is a free map for the world like Wikipedia for maps.  We export from GeoDjango to OSM XML. Import into OSM through its REST AP. Mapnik + mod_tile. The output is a tiled map. Flickr asked if they could take our tiles and use it for people to tag photos. Was really easy to do because we used basic tiles.

We need to start making our map tiles and our geotags time aware because the earth changes. By using OpenStreetMap you can get Garmin maps for free. There’s a freeware product called cGPSMapper. We used Garmin Rinos because they have radio built in so you can see where your friends are. At burning man it was incredibly useful. We also did vehicle tracking by sending packets over ham radio and APRS. Signal picked up by a digipeater which sends the data on. Some which take that positioning data and post it to the internet.

OpenViewProject.org – we were at WhereCamp at Yahoo and hacked Google street view’s data. Google sent a cease and desist. So a friend of mine bought a lot of gear and a tricycle so that he could do it himself.

Gigapans – a gigapan is a gigapixel image. Greater than 500 megapixels. NASA designed a little, sub $200 robot that captures panoramas and their software stitches it together. There’s a site Gigapan that allows you to view these massive photos and zoom in.

Kite Aerial Photography – you can script photos for cannons using their developer kit.

Google Earth & SketchUp models – Andrew Johnstone would take photos of art and texture models made in SketchUp.

Panel Discussion – Building in the Clouds: Scaling Web 2.0

Thursday, September 18th, 2008

Panel: Jason Hoffman (Joyent), Alistair Croll (Bitcurrent), Alex Barnett (From Bungee Labs to Intuit), Dwight Merriman (10Gen), Jinesh Varia (AWS), Pete Koomen (Google)

Panel session driven by Q&A.

Q) Decision between a component centric cloud and a service centric cloud? In a component centric I need to add instances to my app cluster (i.e. AWS), and in a service centric I write for a specific framework that scales itself (i.e. AppEngine). When does it make sense to focus on each?

A) Hoffman: I think they’ve already converged. It depends on the situation and you do both. The web app tiering has long been dead. You’re already silo’ing your assets. People are going to look at a given functionality in their site and ask what’s the service behind it?

Koomen: With App Engine it’s designed to handle low latency web applications.

Varia: Component clouds are great for flexibility. As the abstractions increase you lose flexibility and you also face lock-in on a technology stack.

Barnett: Scaling for what and why? How much up front consideration do start-ups need to put into becoming scalable? If you’ve only got a set of resources that isn’t infinite how do you face it? The nature and the type of the application will have fundamental implications to the underlying design.

Hoffman: Most web apps don’t have to scale in any reasonable amount of time. Another scaling issue is when you start out bigger and you don’t get enough traffic and have to scale down.

Q) Centralized computing & Distributed computing. Tension going on between centralized and distributed computing. Google has been buying thousands of net scalars and just this morning Amazon announced the cloud delivery service.

A) Merriman: Interesting fact that CDNs are one of the first forms of cloud computing. It’s an easy way to distribute content. Definitely use CDN for static.

Koomen: Scaling is about reducing the constant factor. Has to do with minimizing the amount of work you’re doing in the central server. Whether it’s in the CDN or the client side. It’s about a mentality of reducing what you’re doing on every request.

Hoffman: Amazon was smart about coming out with S3 before coming out with EC2. If you’re dealing with datasets less than a terabyte in size.

Varia: We have been listening a lot. From a scalability perspective many people needed data closer to their customers. Amazon is opening a CDN in 3 continents where the static data will be available from S3 with lower latencies and higher data transfer rates. Customers running RIAs feel it is key to serve content faster.

Q: How much can the edge help?

Hoffman: Outside of serving static content like images the edge doesn’t do anything.

Merriman: I don’t know that I agree with that because if you’re serving to data in Japan.

Hoffman: WAN optimization and network optimization is quite different than edge caching.

Q) How do you measure capacity and performance? What are the metrics you look at?

Kooman: Google cares a lot about CPU and latency. We can scale disk easy.

Hoffman: I think that’s the opposite end. There are things that take up space or move space. Disk space, CPU space, and network space. Then there’s the moving two and from these things. Most people in the real world are not coding against the CPU or CPU bound in a web app. Nobodies writing webapps that saturate the band that comes out of a single server. It takes a long time to fill up a terabyte. What people need is memory and better disk I/O. People still use relational databases. Disk I/O is the main thing.

Barnett: We also worry a lot about the end user experience. We’ve instrumented the AJAX library coming down to track every mouse click and interaction that an application has at a very granular level. You’re able to measure every click in a matter of milliseconds every single click and the latency on web service calls.
Hoffman: There doesn’t currently exist tooling to take end-user experience and feed that all the way back to capacity planning.

Merriman: We had to serve 10 – 20 billion ads per day. There’s a lot of CPU involved in picking which ads to serve. Other issue was just the database. “Have you seen this ad before? How many times?” Lots of data you access in real time and on the back-end on event processing. We looked at CPU a lot and I/O utilization on the database servers.

Varia: At Amazon, metrics is the key. From individual developer, to business, to our whole organization. From a developer we measure in time byte hours which is how much data that person is storing and how it grows. From S3 we measure the number of objects stored (22 billion objects stored) and the number of transactions. We peak at 50,000 transactions per second. We stay ahead of the curve. On the business side we need to understand our segmentation of large, medium, and small businesses.

Barnett: It’s interesting that when we charge for services on a utility model we

Koomen: We’re not going to be able to prevent people from taking out cloud services if they write bad code. So it’s important for us to be able to figure out where the problems exist and bubble that up to the user so they’re not making bad decisions.

Q: How do you guys deal with one rogue app?

Varia: Animoto is a very cool web 2.0 application where you upload your photos and music tracks in a way that it creates a really cool video out of it rendering your photos. They created a Facebook app they went from 25,000 users total, they went to adding 25,000 users every hour. Scaled from 50 servers to 5000 servers in 2 days. They were able to do this because they are built on a cloud platform. They scaled it down during the night time to save on cost. Some of these applications are bursting, no doubt. On an aggregate level the curve is pretty smooth. Amazon takes tremendous pride in figuring out how to add servers and services.

We have certain limits which prevent developers from starting 1000 instances. You are capped at 20 initially. If you want more you have to talk to us. There are security and safety mechanisms in place. If a business has a valid business case we’ll flip the switch.

Hoffman: If you have to spin up new virtual machines to handle traffic bursts you’re going to miss the burst.

Koomen: We deal with bursts like that by dealing with every request agnostically. To address the question from our side on what you do to prevent the users from exploiting a system. We’ve got quotas that measure what individual applications can consume and some knobs to turn.