Photosynth

lodi · Post by **lodi** » Fri Jun 08, 2007 4:00 am

(This doesn't use ogre but I thought this link was just unbelievable with respect to what they show and applications to 3d scanning)

My buddy forwarded me this link on a microsoft research project that looks like it's 'straight out of minority report'

http://youtube.com/watch?v=s-DqZ8jAmv0

(you can skip everything after 7:40; it's a bmw commercial)

and then

http://labs.live.com/photosynth/view.ht ... index1.sxs

(ie6+ only)

Frenetic · Post by **Frenetic** » Fri Jun 08, 2007 7:46 am

Woah.

An impressive extension and useage of current image-processing algorithms! How many more hops, skips and jumps before we get an AI that can look at something and go "That's a house", "That's a street with a red car in it", "That's an inferior fleshbag ambulating on his filthy appendages toward an intoxicating substance repository" ... and then link all the information up, become a hive-mind, and destroy all humanity.

(ie6+ only)

But aye, there's the rub! Let me know when I can see some cross-platform source code.

Wretched_Wyx · Post by **Wretched_Wyx** » Fri Jun 08, 2007 9:07 am

Photosynth is pretty impressive looking indeed, but definitely not "cutting edge" stuff. Really that's all old tech, with a pretty face. Not to knock it or anything, I'd love to get my hands on that and toy around a bit.

As far as "cutting edge", without saying too much and getting men in black suits at my door within hours of posting this- I can say I know a guy who knows a guy who once saw something out of the corner of his eye while walking past this room... To make a long story short, the aforementioned (from Frenetic's post) stuff has actually been done for years. It was already being messed with in the hobbyist realm, as well as being used in experimental "military applications". With the advent of the terrorist attack of 9/11, that kind of stuff was ushered from the "misc applications" area, to the forefront of intelligent surveillance software.

Perhaps you've caught a show on the surveillance systems of Las Vegas casinos? Well that software, as non-cutting edge as it is, does some crazy stuff as well. Every person that walks into a casino gets face printed, and at any given time can be tracked to a scary degree of accuracy. And all this being done with AI, all that needs to be done by the person sitting at the computer, is hit the button that says "track" basically.

Now, taking all this into consideration, and knowing that the stuff we actually see on BRAND NAME programs or the internet is far behind the "black ops" kinda technology the government (erm, the US government at least) always has cooking... It scares me to think of THAT kind of stuff.

The way I see it, it's not a matter of how many hops and skips until we get some crazy AI that knows what we drink, how many times we wipe, etc... But a matter of when all the existing stuff is put together and applied in this manner. Which really, is all Photosynth is, a bunch of current and past gen techniques and algorithms being smashed together to make it what it is.

Whoa. I was rambling there.

beaugard · Post by **beaugard** » Fri Jun 08, 2007 10:07 pm

If this actually works for arbitrary collections of photos, I am very impressed. The feature recognition is old news, that's true. The real challenge is fitting camera positions and 3d positions of features simultaneously. The solution space must be absolutely enormous! I really wonder what kind of strategies they use to cover it all.
The second problem would be noise in the dataset and/or an unbalanced dataset. Maybe these collections are especially suited - weighted to cover all the space without focusing too much on a specific monument, etc, but for an arbitrary collection... Well, I suppose they discard photos that do not share enough features, but anyway.

I doubt whether this particular technology will help so much in adding deciphering the semantics of a picture. The kind of software that guesses the position of a face, for example, does not need this at all. In fact in most cases of surveillance, you already know what to look for, so the problem is reduced to matching this "known" to the data. In the case of phtosynth nothing is known beforehand.

Fredz · Post by **Fredz** » Mon Jun 11, 2007 11:51 pm

beaugard wrote:The feature recognition is old news, that's true.

It's not completely old news, corner detectors come back from the early eighties (Moravec 1980), but there has been a lot of work on the subject since then, and new algorithms are found and refined each year. The one used in Photosynth is SIFT, which has been discovered by Lowe in 2004.

beaugard wrote:The real challenge is fitting camera positions and 3d positions of features simultaneously. The solution space must be absolutely enormous! I really wonder what kind of strategies they use to cover it all. The second problem would be noise in the dataset and/or an unbalanced dataset.

The maths involved in this field are known since quite a long time too (early nineties) but knowledge is also still evolving on that matter. The solution space is quite easily solved with least square techniques (8 point algorithm) and noise can be accounted for with robust iterative solutions (RANSAC). There are several open source libraries you can use to produce the same results without too much efforts (VXL, OpenCV, Gandalf, RAVL, etc.).

In my opinion, the real innovation brought by this software is the concept itself, which is quite different from classical research goals (robotic vision and automatic 3d reconstruction instead of 3D photo-montage). This and the fact that they did succeed in creating an application oriented towards end-users and not only researchers.

beaugard · Post by **beaugard** » Tue Jun 12, 2007 11:08 am

@Fredz
You are probably right that the biggest innovation is the interface. I still maintain that the biggest technical challenge is solving ~100 camera positions and thousands(?) of feature positions without any prior knowledge.

The solution space is quite easily solved with least square techniques (8 point algorithm)

'cmon, easily solved?! What about "theoretically possible" to solve? Solving two uncalibrated cameras might be easy (although I have heard bad things about the 8 point algorithm with noisy data), but this is something entirely different.

See, my point is not that there must be a groundbreaking mathematical discovery behind this. I just haven't seen it done with such a lot of material (many camera positions) and such noisy data (cell phone cameras!). The solution space is all possible positions/directions for ~100 cameras! Most similar applications I have seen use two cameras, or known relative camera positions, or at least one object photographed from different angles or at least pictures taken with the same camera...

beaugard · Post by **beaugard** » Tue Jun 12, 2007 11:53 am

Correction: the solution space is of course all positions/orientations of ~100 cameras and all positions of thousands of features (with less than 100 features visible for most cameras).

Fredz · Post by **Fredz** » Tue Jun 12, 2007 8:13 pm

The fact that there are 100 cameras doesn't complicate the problem at all, it has been proved by Faugeras in 1995 that the system is overstated with more than three cameras.

You need only 6 point correspondances in each of 3 images to obtain the extrinsic parameters of the 3 cameras up to a scale factor. For 100 cameras, you just need to iterate the same algorithm with groups of 3 images without needing any information about other images.

Fredz · Post by **Fredz** » Tue Jun 12, 2007 8:16 pm

beaugard wrote:Solving two uncalibrated cameras might be easy (although I have heard bad things about the 8 point algorithm with noisy data), but this is something entirely different.

Read "In Defense of the Eight-Point Algorithm" by Hartley for a very good solution to this problem. It does give results almost as good as the best iterative solutions with this normalization step.

beaugard · Post by **beaugard** » Wed Jun 13, 2007 12:06 am

hehe... you've directed me to some pretty interesting papers. Thanks.
Once again, I do not mean it is mathematically a difficult problem. The problem is handling so much data with limited processor time.

I found the paper, actually.
http://phototour.cs.washington.edu/Photo_Tourism.pdf

It is not very technical, but there are some nice facts. As I guessed, the complexity increase alarmingly fast with the number of cameras. For 82 cameras they have a run-time of a few hour, while for 597 photos run-time was two weeks. So, whereas the mathematical solution is not more difficult for more cameras the engineering problem (making it in a reasonable time) is worse.
They also seem to have some nice trick for estimating K (properties of the lens).

lodi · Post by **lodi** » Wed Jun 13, 2007 5:22 am

(I messed it up in my original post but the project is called 'seadragon'. 'Photosynth' is just the web version)

I'm not too familiar with the state of the art in this field, but I suppose what blew me away wasn't just the part about stitching together photographs. I was most impressed with how "slick" the whole application was in general. The performance was fluid on multi-gigabyte data sets, the zooming was perfect, and the part about zooming into the ad to see horsepower characteristics and such was pure outside-the-box thinking. I suppose it's functionally equivalent to clicking a link and getting a pop up or a new tab, and yet it's so visual and easy to skim through. Then again I'm used to Adobe Acrobat taking half a minute to load and taking seconds to scroll through pages in a document (nevermind changing zoom levels) so maybe I'm just easy to impress :-)

JohnJ · Post by **JohnJ** » Wed Jun 13, 2007 6:12 am

The performance was fluid on multi-gigabyte data sets, the zooming was perfect, and the part about zooming into the ad to see horsepower characteristics and such was pure outside-the-box thinking.

I agree, it's quite impressive, whether or not this technology is old. Just like making a game, success is not a matter of how advanced your technology is, but how well you put what's available together - results is the keyword here

.