CameraTransform: A Python package for perspective corrections and image mapping

Camera images and video recordings are simple and non-invasive tools to investigate animals in their natural habitat. Quantitative evaluations, however, often require an exact reconstruction of object positions, sizes, and distances in the image. Here, we provide an open source software package to perform such calculations. Our approach allows the user to correct for perspective distortion, transform images to “bird’s-eye" view projections, or transform image-coordinates to real-world coordinates and vice versa. The extrinsic camera parameters that are necessary to perform such image corrections and transformations (elevation, tilt/roll angle, and heading of the camera) are obtained from the image using contextual information such as a visible horizon, GPS coordinates of landmarks, known object sizes, or images of the same object obtained from different viewing angles. All mathematical operations are implemented in the Python package CameraTransform. The performance of the implementation is evaluated using computer-generated synthetic images with known camera parameters. Moreover, we test our algorithm on images of emperor penguin colonies, and demonstrate that the camera tilt and roll angles can be estimated with an error of less than one degree, and the camera elevation with an error of less than 5%. The CameraTransform software package simplifies camera matrix-based image transformations and the extraction of quantitative image information. An extensive documentation and usage examples in an ecological context are provided at http://cameratransform.readthedocs.io.


Introduction
Optical recordings such as on-demand images from camera traps, continuous time-lapse images, or video recordings, are a widely used tool in ecology [2,7,1].While such recordings are useful for counting animals and estimating abundances [5], they inherently contain perspective distortions that make it difficult to measure positions and distances.To correct for such distortions and to map image points to real-world positions, it is paramount to know certain camera parameters.This includes the geographic camera position relative to landmarks in the scenery, the camera height, tilt/roll angle and heading.These parameters are often difficult or impossible to evaluate in the field at the time of the recording, but they can be reconstructed afterwards if the real-world coordinates of prominent features in the images are known.The mathematical procedure behind this reconstruction is based on simple linear algebra, but the steps to apply the underlying matrix operations to image data can be somewhat involved .
In this article we present the python package CameraTransform that was developed to facilitate post-recording calibration based on single (not stereo) images.CameraTransform provides various tools to estimate the camera parameters from features present in the image, and transforms point coordinates in the image to real-world or to geographic coordinates.We explain the mathematical details of the calibration and transformation, present calibration examples and provide an analysis of the uncertainty of the procedures.

Camera Matrix
All information about the mapping of real-world points to image points are stored in a camera matrix.The camera matrix is expressed in projective coordinates, and can be split into two parts: the intrinsic matrix and the extrinsic matrix [3].The intrinsic matrix depends on the camera sensor and lens, the extrinsic matrix depends on the camera's position and orientation.

Projective coordinates
Projective coordinates, also known as homogeneous coordinates, are used to represent projective transformations as matrix multiplications [6].They are a mathematical trick that extends the vector representation of a point with an additional entry.This entry defaults to 1, and all scalar multiples of a vector are considered equal: For example, the point (5,7) can be represented by the tuple of projective coordinates (5,7,1) or (10,14,2) and so on.The scalar s need not be an integer.Projective coordinates allow us to write the camera projection y as: where x specifies the point in the 3D world, which is transformed with the camera matrix C to obtain the point in the camera image y.

Intrinsic parameters
To compute the intrinsic matrix entries, we need to know the focal length f of the camera in mm, the sensor dimensions (w sensor × h sensor ) in mm, and the image dimensions (w image × h image ) in pixels.The intrinsic matrix entries are then the effective focal length f pix and the centre of the image (w image /2, h image /2) according to Here, the diagonal elements account for the rescaling from pixels in the image to a position in mm on the chip.The off-diagonal elements present an offset, whereby the origin of the image is at the top left corner, and the origin of the chip coordinates is at the centre of the chip.

Extrinsic parameters
To compute the extrinsic matrix, we need to know the offset (x,y,z) of the camera relative on an arbitrary fixed real-world reference point (0,0,0) in the three spatial directions.Customarily, the z-coordinate of the reference point is the ground, and z is therefore the height of the camera above ground.Similarly, the x,y plane of our coordinate system is customarily the horizontal plane.
We also need to know three angles: the tilt angle α tilt , which specifies how much the camera is Side view: the height specifies how high the camera is positioned over the ground, the tilt angle specifies how much the camera is tilted against the horizontal.Top view: the offset (x, y) specifies how much the camera is moved from the origin and the heading angle specifies in which direction it is looking.Image: the roll specifies how much the image is rotated around its centre.
tilted against the horizontal, the heading angle α heading which specifies the direction relative to the y-direction in which the camera is heading, and the roll angle α roll which specifies how the image is rotated (see Fig. 1).
To compute the extrinsic camera matrix, we first need the three rotation matrices and the translation matrix: The extrinsic camera matrix then consists of the 3x3 rotation matrix R and the 3x1 translation matrix t side by side, as a 4x4 matrix in projective coordinates.
The final camera matrix C is the product of the intrinsic and the extrinsic camera matrix.

Projecting from the World to the Camera
Based on the camera matrix C , it is straight forward to see how a real-world point corresponds to a pixel of the acquired image.First, the real-world point p world (x 1 , x 2 , x 3 ) is written in projective coordinates: were p denotes the vector p in projective coordinates.Second, the point p can be projected to the image coordinates: Finally, the point pim is converted back from projective coordinates (which has 3 entries) to "conventional" coordinates p im (with two entries) by dividing by the additional scaling factor s (which is the 3rd entry of pim ): where the subscript denotes the entry of the vector pim .

Projecting from the Camera back to real-world coordinates
While projecting from the 3D real-world to the 2D image is a straight forward matrix multiplication, projecting from the image back to the real-world is more difficult.As the information of the 3rd dimension is lost during the transformation from the real-world to the image, there exists no unique back-transformation.An additional constraint is needed to transform a point back to the 3D world, e.g. one of the 3D coordinates must be fixed.For example: if the real-world point p world has a fixed x 2 coordinate (for example a mural painting on a vertical wall that is aligned in the y-direction of the coordinate system) and the image coordinates y 1 and y 2 are given, the back-transformation can be performed as follows: This means that the information about the fixed 3D coordinate has to be incorporated in the camera matrix.The inverse of the resulting matrix, when multiplied with the image point in projective coordinates, gives the unknown x 1 and x 3 entries of the real-world 3D point.After rescaling the vector entries (division by s), the known x 2 value is added to the vector to retrieve the real-world coordinates of the 3D point p world .
The same approach can be used with fixed x 1 coordinates or, more relevant for many applications, with fixed x 3 coordinates (i.e.objects on a levelled surface are imaged) (see appendix A).

Fitting Camera Parameters
Often, only the intrinsic camera parameters are known, but not the extrinsic parameters that define the orientation of the camera.The CameraTransform package provides several fitting routines that allow users to infer the extrinsic parameters from characteristic features in the image.
In many cases, the heading and position of the camera can be set to 0, as they are only of interest when the camera image needs to be compared to other camera images or when it needs to be cartographically mapped.This leaves only the parameters height, tilt and roll free, unless the camera was properly horizontally aligned, in which case roll is zero.

Influence of Camera Parameter Uncertainties
To evaluate the sensitivity of the perspective projection with respect to uncertainties in the camera parameters, we computationally place objects of 1 m height in world coordinates at different distances from the camera (50 -300 m) and project them to the camera image.The positions in the camera image are then projected back to real-world coordinates using a different parameter set where we vary the camera height and tilt angle.We use a focal length of 14 mm, a sensor size of 17.3×9.7 mm with 4608×2592 px.The camera is placed at a height of 20 m with a tilt angle of 80 • .For the back projection, the height and tilt are varied by ±10% (Fig. 2).For each parameter configuration, the apparent object height calculated.Since we know the true object height, the Objects with a height of 1 m (dashed line) and different distances (50 m -300 m) projected to the camera and back to the world with changed camera parameters.A) For variation of the heigh parameter: 20 m ± 10% and B) the tilt parameter: 80 • ± 10%.
reconstructed object height indicates the error that is introduced by uncertainties in the extrinsic camera parameters.We find that the apparent object height is robust to variations in camera height regardless of the distance between object and camera (Fig. 2A).By contrast, the apparent object height is sensitive to variations in the camera's tilt angle, especially for objects with larger distance from the camera (Fig. 2B).

Fitting extrinsic parameters from object of known height
If the true height of objects in the image is know, the camera parameters can be fitted.This works especially well for the tilt angle as it most sensitively affects the apparent object height (Fig. 2B).The input for the fitting routing is a list of base (foot) and top (head) positions of the objects.For fitting, the algorithm projects the foot positions from the image to world coordinates, moves the base positions in z-direction by the known object height, and projects these points back to the camera image.The difference between the input top positions and the back-projected top positions is then minimized with a least-squares fit routine.Optionally, if a horizon is visible in the image, CameraTransform uses the horizon line as an additional constraint for fitting the camera parameters.The error between the user-selected horizon and the fitted horizon is assigned a weight of 50% of the total error.
To evaluate this method, an artificial image is created using the CameraTranform package.We use again a focal length of 14 mm, a sensor size of 17.3×9.7 mm with 4608×2592 px, a camera height of 20 m and a tilt angle of 80 • .15 rectangles with a width of 30 cm and a height of 1 m are placed at distances ranging from 50 to 150 m.Using the software ClickPoints, we mark the base and top positions of these rectangles and provide them as input for the fitting routine.We then investigate how the fitted height and the fitted tilt angle vary with the number of provided objects.We start with only one object and increase the number of objects to 15 .For every iteration, the objects are randomly chosen.The experiment was repeated multiple times with and without a horizon.
The results indicate, as expected, that by including a larger number of objects, the uncertainty of the parameter estimate (as indicated by the variability between repeated measurements) decreases (Fig. 3).Both, the camera height and the tilt angle can be fitted with considerably less uncertainty if a horizon is provided (Fig. 3D,E), compared to parameter estimates without horizon (Fig. 3A,B).The reconstructed object heights ( Fig. 3C,F) follow the same pattern and also profit from the horizon information.
To demonstrate the fitting procedure, we analyse an image (Fig. 4A) from a wide-angel camera overseeing an Emperor penguin colony at Pointe Géologie, Antarctica.The camera was positioned Top row A-C) without a given horizon, bottom row D-F) with a given horizon.A+D) The fitted camera height and, B+E) the camera tilt angle for different numbers of used objects.For each number of objects a random selection (without replacing) is taken from the clicked objects and the camera matrix is fitted (parameters blue dots).From these the mean is calculated (red crosses).C-F) The error on fitted object heights for different fits (mean± std, blue errorbars).
on a nunatak, but no height information was provided.We estimate the extrinsic camera parameters by analysing the feet and head positions of 20 animals, assuming an average penguin height of 1 m.Fig. 4B shows the projected top view after fitting the extrinsic camera parameters.The camera height obtained by the fit is with 23.7 m close the to the height value of 25.7 m measured by a differential GPS.

Fitting by geo-referencing
For large tilt angles, e.g. if images are taken by a helicopter (Fig. 5A), the size of the objects in the image does not vary sufficiently with the y position in the image so that the fitting approach based on the known object size is not viable.In addition, the horizon unlikely to be visible.For such images, a different method is needed.If the approximate x,y location of the camera is known and an accurate map or a satellite image is available, point correspondences between the image and the map can be used to estimate the camera parameters using a process known as image registration.
In the example shown in Fig. 5 where we photograph a King penguin colony at Baie du Marin from a helicopter flying approximately 300 m above ground, we use eight points that are recognizable in the camera image and a satellite image provided by Google Earth (Fig. 5A,C).The cost function for our image registration is the distance between the projection of the image points to real-world coordinates and the corresponding points in the satellite image.The fit routine then computes the height and tilt of the camera as well as the xy-position and heading angle.The example in Fig. 5 demonstrates that the fit routine matches all points except point #7, which is the branch point of a river that likely has shifted from the time the satellite image was taken (Fig. 5B).

Summary
We present a python package for estimating extrinsic camera parameters based on image features, for image geo-referencing and correcting for perspective image distortions.The package is designed to assist in analysing images for ecological applications.The package is published under the GPLv3

Figure 1 -
Figure 1 -Extrinsic camera parameters.Side view: the height specifies how high the camera is positioned over the ground, the tilt angle specifies how much the camera is tilted against the horizontal.Top view: the offset (x, y) specifies how much the camera is moved from the origin and the heading angle specifies in which direction it is looking.Image: the roll specifies how much the image is rotated around its centre.

Figure 2 -
Figure 2 -Influence of height and tilt angle variation of ± 10%.Objects with a height of 1 m (dashed line) and different distances (50 m -300 m) projected to the camera and back to the world with changed camera parameters.A) For variation of the heigh parameter: 20 m ± 10% and B) the tilt parameter: 80 • ± 10%.

Figure 3 -
Figure3-Influence of number of object used for fitting.Top row A-C) without a given horizon, bottom row D-F) with a given horizon.A+D) The fitted camera height and, B+E) the camera tilt angle for different numbers of used objects.For each number of objects a random selection (without replacing) is taken from the clicked objects and the camera matrix is fitted (parameters blue dots).From these the mean is calculated (red crosses).C-F) The error on fitted object heights for different fits (mean± std, blue errorbars).

Figure 4 -Figure 5 -
Figure 4 -Application to real data.A) Image taken with the MicrObs system of a penguin colony.The feet (green) and head (blue) positions of 20 penguins were manually marked.This data was used to fit the camera perspective (fitted heads: red crosses), which allows to project the image to a top view (B).