zero.concat¶
- zero.concat(iterable)[source]¶
Concatenate items (tensors, numpy-arrays, tuples, dicts etc.) along the first dimension.
concat
is a more general version oftorch.cat(..., dim=0)
. It works not only with sequences of tensors, but also with sequences of containers (tuples, dicts etc.) of different types of data (tensors, numpy-arrays, primitive types). See the tutorial and the examples below to understand what the function does.- Parameters
iterable (Iterable[zero.data.T]) – items of the same structure (for example, “an iterable of tensors” OR “an iterable of tuples of tensors where all the tuples are of the same length” OR “an iterable of dicts of tensors and numpy-arrays where all the dicts have the same keys” etc.)
- Returns
Concatenated items of the iterable.
- Return type
zero.data.T
Note
The concatenation algorithm is fully determined by the first item of the iterable. If there are items of different structure, then the function is likely to fail or produce incorrect results, hence the requirement of the same structure for all items of the iterable.
Warning
The function starts with conversion of the iterable to a list. Make sure that you have enough memory for such operation, otherwise, memory limit may be exceeded. Note that in most cases manual implementation would involve the same conversion, just keep this in mind when using the function.
See also
Tutorial
For usage examples, scroll further.
If you have an iterable that contains/produces batches of some kind (tensors, numpy-arrays, tuples/dictionaries thereof and other not-too-specific content), then use
concat
to concatenate all the items. A prominent case is application of models and functions to batches (e.g. toDataLoader
):whole_result = concat(map(model_or_fn, batches)) # or whole_result = concat(expression(x) for x in batches)
For example:
dataset = ... # PyTorch dataset loader = DataLoader(dataset, batch_size) def step(batch): X, y = batch return model(X), y y_pred, y = concat(map(step, loader)) assert len(y_pred) == len(dataset) and len(y) == len(dataset) # or def step(batch): X, y = batch return {'y_pred': model(X), 'y': y} result = concat(map(step, loader)) # no changes assert result['y_pred'] == len(dataset) and len(result['y']) == len(dataset)
The function can be used in combination with
iter_batches
. For example, this is how pairwise dot products can be calculated in a batchwise manner if full matrix multiplication does not fit into memory:n_objects = 100 n_features = 16 batch_size = 20 data = torch.randn(n_objects, n_features) result = concat( batch.matmul(data.T).to('cpu') for batch in iter_batches(data, batch_size) ) assert result.shape == (n_objects, n_objects)
Or even like this:
n_objects = 100 n_features = 16 batch_size = 20 data = torch.randn(n_objects, n_features) result = concat( concat(b.matmul(a.T).to('cpu') for b in iter_batches(data, batch_size)).T for a in iter_batches(data, batch_size) ) assert result.shape == (n_objects, n_objects)
Examples
How to read the examples:
the mental model for understanding the following examples is “concatenating data for 3 batches of sizes (2, 2, 3)”. Note that sizes of batches are allowed to vary, but the structure is always the same
in all examples there is
data
- a list of batches; in fact, it can be any “iterable of batches”, including iterators and generators; the list is chosen to simplify the demonstration
1-D example:
result = concat([ torch.tensor([0, 1]), torch.tensor([2, 3]), torch.tensor([4, 5, 6]) ]) assert torch.equal(result, torch.tensor([0, 1, 2, 3, 4, 5, 6]))
2-D example:
result = concat([ torch.tensor([ [0, 0], [1, 1] ]), torch.tensor([ [2, 2], [3, 3] ]), torch.tensor([ [4, 4], [5, 5], [6, 6], ]), ]) assert torch.equal( result, torch.tensor([ [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6] ]) )
N-D example: <the same>.
The following examples demonstrate support for different kinds of input data; data is 1-D everywhere just for simplicity (i.e. dimensions can be arbitrary).
array = np.array tensor = torch.tensor l = [0, 1, 2, 3, 4, 5, 6] a = array([0, 1, 2, 3, 4, 5, 6]) t = tensor([0, 1, 2, 3, 4, 5, 6]) data = [[0, 1], [2, 3], [4, 5, 6]] assert concat(data) == l data = [array([0, 1]), array([2, 3]), array([4, 5, 6])] assert np.array_equal(concat(data), a) data = [tensor([0, 1]), tensor([2, 3]), tensor([4, 5, 6])] assert torch.equal(concat(data), t) # If items are not lists, arrays nor tensors, the data is returned in a form # of a list. It makes sense since the list of such items is already # a result for all batches. data = ['three batches, hence three items', 0, 1.0] assert concat(data) == data data = [ ([0, 1], array([0, 1]), tensor([0, 1])), ([2, 3], array([2, 3]), tensor([2, 3])), ([4, 5, 6], array([4, 5, 6]), tensor([4, 5, 6])), ] result = concat(data) assert isinstance(result, tuple) and len(result) == 3 assert ( result[0] == l and np.array_equal(result[1], a) and torch.equal(result[2], t) ) data = [ {'l': [0, 1], 'a': array([0, 1]), 't': tensor([0, 1])}, {'l': [2, 3], 'a': array([2, 3]), 't': tensor([2, 3])}, {'l': [4, 5, 6], 'a': array([4, 5, 6]), 't': tensor([4, 5, 6])}, ] result = concat(data) assert isinstance(result, dict) and list(result) == ['l', 'a', 't'] assert ( result['l'] == l and np.array_equal(result['a'], a) and torch.equal(result['t'], t) )