Skip to content

Commit 0eb06a9

Browse files
authored
docs: update dam chapter and add uses_with , uses_metas , uses_requests to Flow.md (#3146)
1 parent 2b79c06 commit 0eb06a9

File tree

2 files changed

+315
-42
lines changed

2 files changed

+315
-42
lines changed

.github/2.0/cookbooks/Document.md

Lines changed: 201 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,15 @@ Table of Contents
2828
- [Set & Unset attributes](#set--unset-attributes)
2929
- [Access nested attributes from tags](#access-nested-attributes-from-tags)
3030
- [Construct `Document`](#construct-document)
31+
- [Content attributes](#content-attributes)
3132
- [Exclusivity of `doc.content`](#exclusivity-of-doccontent)
3233
- [Conversion between `doc.content`](#conversion-between-doccontent)
3334
- [Set embedding](#set-embedding)
35+
- [Support for sparse arrays](#support-for-sparse-arrays)
3436
- [Construct with multiple attributes](#construct-with-multiple-attributes)
37+
- [Meta attributes](#meta-attributes)
3538
- [Construct from dict or JSON string](#construct-from-dict-or-json-string)
39+
- [Parsing unrecognized fields](#parsing-unrecognized-fields)
3640
- [Construct from another `Document`](#construct-from-another-document)
3741
- [Construct from JSON, CSV, `ndarray` and files](#construct-from-json-csv-ndarray-and-files)
3842
- [Construct Recursive `Document`](#construct-recursive-document)
@@ -63,12 +67,13 @@ Table of Contents
6367
- [`DocumentArrayMemmap` API](#documentarraymemmap-api)
6468
- [Create `DocumentArrayMemmap`](#create-documentarraymemmap)
6569
- [Add Documents to `DocumentArrayMemmap`](#add-documents-to-documentarraymemmap)
70+
- [Buffer pool](#buffer-pool)
71+
- [Modifying elements of `DocumentArrayMemmap`](#modifying-elements-of-documentarraymemmap)
6672
- [Clear a `DocumentArrayMemmap`](#clear-a-documentarraymemmap)
6773
- [Pruning](#pruning)
68-
- [Mutable sequence with "read-only" elements](#mutable-sequence-with-read-only-elements)
6974
- [Side-by-side vs. `DocumentArray`](#side-by-side-vs-documentarray)
7075
- [Convert between `DocumentArray` and `DocumentArrayMemmap`](#convert-between-documentarray-and-documentarraymemmap)
71-
- [Maintaining consistency via `.reload()`](#maintaining-consistency-via-reload)
76+
- [Maintaining consistency via `.reload()` and `.save()`](#maintaining-consistency-via-reload-and-save)
7277

7378
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
7479

@@ -1265,9 +1270,10 @@ da.visualize()
12651270

12661271
When your `DocumentArray` object contains a large number of `Document`, holding it in memory can be very demanding. You
12671272
may want to use `DocumentArrayMemmap` to alleviate this issue. A `DocumentArrayMemmap` stores all Documents directly on
1268-
the disk, while only keeps a small lookup table in memory. This lookup table contains the offset and length of
1269-
each `Document`, hence it is much smaller than the full `DocumentArray`. Elements are loaded on-demand to memory during
1270-
the access.
1273+
the disk, while keeping a small lookup table in memory and a buffer pool of documents with a fixed size. The lookup
1274+
table contains the offset and length of each `Document`, hence it is much smaller than the full `DocumentArray`.
1275+
Elements are loaded on-demand to memory during the access. Memory-loaded documents are kept in the buffer pool to allow
1276+
modifying documents.
12711277

12721278
The next table show the speed and memory consumption when writing and reading 50,000 `Documents`.
12731279

@@ -1299,22 +1305,32 @@ dam = DocumentArrayMemmap('./my-memmap')
12991305
dam.extend([d1, d2])
13001306
```
13011307

1302-
The `dam` object stores all future Documents into `./my-memmap`, there is no need to manually call `save`/`load`. In
1303-
fact, `save`/`load` methods are not available in `DocumentArrayMemmap`.
1308+
The `dam` object stores all future Documents into `./my-memmap`, there is no need to manually call `save`/`reload`.
1309+
Recently added, modified or accessed documents are also kept in the memory buffer so all changes to documents are
1310+
applied first in the memory buffer and will be persisted to disk lazily (e.g when they quit the buffer pool or when
1311+
the `dam` object's destructor is called). If you want to instantly persist the changed documents, you can call `save`.
13041312

1305-
### Clear a `DocumentArrayMemmap`
13061313

1307-
To clear all contents in a `DocumentArrayMemmap` object, simply call `.clear()`. It will clean all content on disk.
1314+
### Buffer pool
1315+
A fixed number of documents are kept in the memory buffer pool. The number can be configured with the constructor
1316+
parameter `buffer_pool_size` (1000 by default). Only the `buffer_pool_size` most recently accessed, modified or added
1317+
documents exist in the pool. Replacement of documents uses the LRU strategy.
13081318

1309-
#### Pruning
1319+
```python
1320+
from jina.types.arrays.memmap import DocumentArrayMemmap
1321+
from jina import Document
1322+
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1323+
dam.extend([Document() for _ in range(100)])
1324+
```
13101325

1311-
One may notice another method `.prune()` that shares similar semantics. `.prune()` method is designed for "
1312-
post-optimizing" the on-disk data structure of `DocumentArrayMemmap` object. It can reduce the on-disk usage.
1326+
The buffer pool ensures that in-memory modified documents are persisted to disk. Therefore, you should not reference
1327+
documents manually and modify them if they might be outside of the buffer pool. The next section explains the best
1328+
practices when modifying documents.
13131329

1314-
### Mutable sequence with "read-only" elements
1330+
### Modifying elements of `DocumentArrayMemmap`
13151331

1316-
The biggest caveat in `DocumentArrayMemmap` is that you can **not** modify element's attribute inplace. Though
1317-
the `DocumentArrayMemmap` is mutable, each of its element is not. For example:
1332+
Modifying elements of a `DocumentArrayMemmap` is possible due to the fact that accessed and modified documents are kept
1333+
in the buffer pool:
13181334

13191335
```python
13201336
from jina.types.arrays.memmap import DocumentArrayMemmap
@@ -1332,33 +1348,151 @@ print(dam[0].text)
13321348
```
13331349

13341350
```text
1335-
hello
1336-
```
1337-
1338-
One can see the `text` field has not changed!
1339-
1340-
To update an existing `Document` in a `DocumentArrayMemmap`, you need to assign it to a new `Document` object.
1351+
goodbye
1352+
```
1353+
1354+
However, there are practices to **avoid**. Mainly, you should not modify documents that you reference manually and that
1355+
might not be in the buffer pool. Here are some practices to avoid:
1356+
1357+
1. Keep more references than the buffer pool size and modify them:
1358+
<table>
1359+
<tr>
1360+
<td>
1361+
<b><center>❌ Don't</center></b>
1362+
</td>
1363+
<td>
1364+
<b><center>✅ Do</center></b>
1365+
</td>
1366+
</tr>
1367+
<tr>
1368+
<td>
1369+
1370+
```python
1371+
from jina import Document
1372+
from jina.types.arrays.memmap import DocumentArrayMemmap
1373+
1374+
docs = [Document(text='hello') for _ in range(100)]
1375+
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1376+
dam.extend(docs)
1377+
for doc in docs:
1378+
doc.text = 'goodbye'
1379+
1380+
dam[50].text
1381+
```
1382+
```text
1383+
hello
1384+
```
1385+
1386+
1387+
</td>
1388+
<td>
1389+
Use the dam object to modify instead:
1390+
1391+
```python
1392+
from jina import Document
1393+
from jina.types.arrays.memmap import DocumentArrayMemmap
1394+
1395+
docs = [Document(text='hello') for _ in range(100)]
1396+
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1397+
dam.extend(docs)
1398+
for doc in dam:
1399+
doc.text = 'goodbye'
1400+
1401+
dam[50].text
1402+
```
1403+
```text
1404+
goodbye
1405+
```
1406+
1407+
It's also okay if you reference docs less than the buffer pool size:
1408+
1409+
```python
1410+
from jina import Document
1411+
from jina.types.arrays.memmap import DocumentArrayMemmap
1412+
1413+
docs = [Document(text='hello') for _ in range(100)]
1414+
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=1000)
1415+
dam.extend(docs)
1416+
for doc in docs:
1417+
doc.text = 'goodbye'
1418+
1419+
dam[50].text
1420+
```
1421+
```text
1422+
goodbye
1423+
```
1424+
1425+
</td>
1426+
</tr>
1427+
</table>
1428+
1429+
1430+
2. Modify a reference that might have left the buffer pool :
1431+
<table>
1432+
<tr>
1433+
<td>
1434+
<b><center>❌ Don't</center></b>
1435+
</td>
1436+
<td>
1437+
<b><center>✅ Do</center></b>
1438+
</td>
1439+
</tr>
1440+
<tr>
1441+
<td>
1442+
1443+
```python
1444+
from jina import Document
1445+
from jina.types.arrays.memmap import DocumentArrayMemmap
1446+
1447+
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1448+
my_doc = Document(text='hello')
1449+
dam.append(my_doc)
1450+
1451+
# my_doc leaves the buffer pool after extend
1452+
dam.extend([Document(text='hello') for _ in range(99)])
1453+
my_doc.text = 'goodbye'
1454+
dam[0].text
1455+
```
1456+
```text
1457+
hello
1458+
```
1459+
1460+
1461+
</td>
1462+
<td>
1463+
Get the document from the dam object and then modify it:
1464+
1465+
```python
1466+
from jina import Document
1467+
from jina.types.arrays.memmap import DocumentArrayMemmap
1468+
1469+
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1470+
my_doc = Document(text='hello')
1471+
dam.append(my_doc)
1472+
1473+
# my_doc leaves the buffer pool after extend
1474+
dam.extend([Document(text='hello') for _ in range(99)])
1475+
dam[my_doc.id].text = 'goodbye' # or dam[0].text = 'goodbye'
1476+
dam[0].text
1477+
```
1478+
```text
1479+
goodbye
1480+
```
1481+
1482+
</td>
1483+
</tr>
1484+
</table>
1485+
1486+
To summarize, it's a best practice to **rely on the `dam` object to reference the docs that you modify**.
13411487

1342-
```python
1343-
from jina.types.arrays.memmap import DocumentArrayMemmap
1344-
from jina import Document
1345-
1346-
d1 = Document(text='hello')
1347-
d2 = Document(text='world')
1348-
1349-
dam = DocumentArrayMemmap('./my-memmap')
1350-
dam.extend([d1, d2])
1488+
### Clear a `DocumentArrayMemmap`
13511489

1352-
dam[0] = Document(text='goodbye')
1490+
To clear all contents in a `DocumentArrayMemmap` object, simply call `.clear()`. It will clean all content on disk.
13531491

1354-
for d in dam:
1355-
print(d)
1356-
```
1492+
#### Pruning
13571493

1358-
```text
1359-
{'id': '44a74b56-c821-11eb-8522-1e008a366d48', 'mime_type': 'text/plain', 'text': 'goodbye'}
1360-
{'id': '44a73562-c821-11eb-8522-1e008a366d48', 'mime_type': 'text/plain', 'text': 'world'}
1361-
```
1494+
One may notice another method `.prune()` that shares similar semantics. `.prune()` method is designed for "
1495+
post-optimizing" the on-disk data structure of `DocumentArrayMemmap` object. It can reduce the on-disk usage.
13621496

13631497
### Side-by-side vs. `DocumentArray`
13641498

@@ -1407,12 +1541,12 @@ dam.extend(da)
14071541
da = DocumentArray(dam)
14081542
```
14091543

1410-
### Maintaining consistency via `.reload()`
1544+
### Maintaining consistency via `.reload()` and `.save()`
14111545

14121546
Considering two `DocumentArrayMemmap` objects that share the same on-disk storage `./memmap` but sit in different
1413-
processes/threads. After some writing ops, the consistency of the lookup table may be corrupted, as
1414-
each `DocumentArrayMemmap` object has its own version of lookup table in memory. `.reload()` method is for solving this
1415-
issue:
1547+
processes/threads. After some writing ops, the consistency of the lookup table and the buffer pool may be corrupted, as
1548+
each `DocumentArrayMemmap` object has its own version of lookup table and buffer pool in memory. `.reload()` and
1549+
`.save()` are for solving this issue:
14161550

14171551
```python
14181552
from jina.types.arrays.memmap import DocumentArrayMemmap
@@ -1438,3 +1572,28 @@ assert len(dam2) == 2
14381572
dam2.reload()
14391573
assert len(dam2) == 0
14401574
```
1575+
You don't need to use `.save` if you add new documents. However, if you modified an attribute of a document, you need
1576+
to use it:
1577+
1578+
```python
1579+
from jina.types.arrays.memmap import DocumentArrayMemmap
1580+
from jina import Document
1581+
1582+
d1 = Document(text='hello')
1583+
1584+
dam = DocumentArrayMemmap('./my-memmap')
1585+
dam2 = DocumentArrayMemmap('./my-memmap')
1586+
1587+
dam.append(d1)
1588+
d1.text = 'goodbye'
1589+
assert len(dam) == 1
1590+
assert len(dam2) == 0
1591+
1592+
dam2.reload()
1593+
assert len(dam2) == 1
1594+
assert dam2[0].text == 'hello'
1595+
1596+
dam.save()
1597+
dam2.reload()
1598+
assert dam2[0].text == 'goodbye'
1599+
```

0 commit comments

Comments
 (0)