@@ -28,11 +28,15 @@ Table of Contents
28
28
- [ Set & Unset attributes] ( #set--unset-attributes )
29
29
- [ Access nested attributes from tags] ( #access-nested-attributes-from-tags )
30
30
- [ Construct ` Document ` ] ( #construct-document )
31
+ - [ Content attributes] ( #content-attributes )
31
32
- [ Exclusivity of ` doc.content ` ] ( #exclusivity-of-doccontent )
32
33
- [ Conversion between ` doc.content ` ] ( #conversion-between-doccontent )
33
34
- [ Set embedding] ( #set-embedding )
35
+ - [ Support for sparse arrays] ( #support-for-sparse-arrays )
34
36
- [ Construct with multiple attributes] ( #construct-with-multiple-attributes )
37
+ - [ Meta attributes] ( #meta-attributes )
35
38
- [ Construct from dict or JSON string] ( #construct-from-dict-or-json-string )
39
+ - [ Parsing unrecognized fields] ( #parsing-unrecognized-fields )
36
40
- [ Construct from another ` Document ` ] ( #construct-from-another-document )
37
41
- [ Construct from JSON, CSV, ` ndarray ` and files] ( #construct-from-json-csv-ndarray-and-files )
38
42
- [ Construct Recursive ` Document ` ] ( #construct-recursive-document )
@@ -63,12 +67,13 @@ Table of Contents
63
67
- [ ` DocumentArrayMemmap ` API] ( #documentarraymemmap-api )
64
68
- [ Create ` DocumentArrayMemmap ` ] ( #create-documentarraymemmap )
65
69
- [ Add Documents to ` DocumentArrayMemmap ` ] ( #add-documents-to-documentarraymemmap )
70
+ - [ Buffer pool] ( #buffer-pool )
71
+ - [ Modifying elements of ` DocumentArrayMemmap ` ] ( #modifying-elements-of-documentarraymemmap )
66
72
- [ Clear a ` DocumentArrayMemmap ` ] ( #clear-a-documentarraymemmap )
67
73
- [ Pruning] ( #pruning )
68
- - [ Mutable sequence with "read-only" elements] ( #mutable-sequence-with-read-only-elements )
69
74
- [ Side-by-side vs. ` DocumentArray ` ] ( #side-by-side-vs-documentarray )
70
75
- [ Convert between ` DocumentArray ` and ` DocumentArrayMemmap ` ] ( #convert-between-documentarray-and-documentarraymemmap )
71
- - [ Maintaining consistency via ` .reload() ` ] ( #maintaining-consistency-via-reload )
76
+ - [ Maintaining consistency via ` .reload() ` and ` .save() ` ] ( #maintaining-consistency-via-reload-and-save )
72
77
73
78
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
74
79
@@ -1265,9 +1270,10 @@ da.visualize()
1265
1270
1266
1271
When your ` DocumentArray ` object contains a large number of ` Document ` , holding it in memory can be very demanding. You
1267
1272
may want to use ` DocumentArrayMemmap ` to alleviate this issue. A ` DocumentArrayMemmap ` stores all Documents directly on
1268
- the disk, while only keeps a small lookup table in memory. This lookup table contains the offset and length of
1269
- each ` Document ` , hence it is much smaller than the full ` DocumentArray ` . Elements are loaded on-demand to memory during
1270
- the access.
1273
+ the disk, while keeping a small lookup table in memory and a buffer pool of documents with a fixed size. The lookup
1274
+ table contains the offset and length of each ` Document ` , hence it is much smaller than the full ` DocumentArray ` .
1275
+ Elements are loaded on-demand to memory during the access. Memory-loaded documents are kept in the buffer pool to allow
1276
+ modifying documents.
1271
1277
1272
1278
The next table show the speed and memory consumption when writing and reading 50,000 ` Documents ` .
1273
1279
@@ -1299,22 +1305,32 @@ dam = DocumentArrayMemmap('./my-memmap')
1299
1305
dam.extend([d1, d2])
1300
1306
```
1301
1307
1302
- The ` dam ` object stores all future Documents into ` ./my-memmap ` , there is no need to manually call ` save ` /` load ` . In
1303
- fact, ` save ` /` load ` methods are not available in ` DocumentArrayMemmap ` .
1308
+ The ` dam ` object stores all future Documents into ` ./my-memmap ` , there is no need to manually call ` save ` /` reload ` .
1309
+ Recently added, modified or accessed documents are also kept in the memory buffer so all changes to documents are
1310
+ applied first in the memory buffer and will be persisted to disk lazily (e.g when they quit the buffer pool or when
1311
+ the ` dam ` object's destructor is called). If you want to instantly persist the changed documents, you can call ` save ` .
1304
1312
1305
- ### Clear a ` DocumentArrayMemmap `
1306
1313
1307
- To clear all contents in a ` DocumentArrayMemmap ` object, simply call ` .clear() ` . It will clean all content on disk.
1314
+ ### Buffer pool
1315
+ A fixed number of documents are kept in the memory buffer pool. The number can be configured with the constructor
1316
+ parameter ` buffer_pool_size ` (1000 by default). Only the ` buffer_pool_size ` most recently accessed, modified or added
1317
+ documents exist in the pool. Replacement of documents uses the LRU strategy.
1308
1318
1309
- #### Pruning
1319
+ ``` python
1320
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1321
+ from jina import Document
1322
+ dam = DocumentArrayMemmap(' ./my-memmap' , buffer_pool_size = 10 )
1323
+ dam.extend([Document() for _ in range (100 )])
1324
+ ```
1310
1325
1311
- One may notice another method ` .prune() ` that shares similar semantics. ` .prune() ` method is designed for "
1312
- post-optimizing" the on-disk data structure of ` DocumentArrayMemmap ` object. It can reduce the on-disk usage.
1326
+ The buffer pool ensures that in-memory modified documents are persisted to disk. Therefore, you should not reference
1327
+ documents manually and modify them if they might be outside of the buffer pool. The next section explains the best
1328
+ practices when modifying documents.
1313
1329
1314
- ### Mutable sequence with "read-only" elements
1330
+ ### Modifying elements of ` DocumentArrayMemmap `
1315
1331
1316
- The biggest caveat in ` DocumentArrayMemmap ` is that you can ** not ** modify element's attribute inplace. Though
1317
- the ` DocumentArrayMemmap ` is mutable, each of its element is not. For example :
1332
+ Modifying elements of a ` DocumentArrayMemmap ` is possible due to the fact that accessed and modified documents are kept
1333
+ in the buffer pool :
1318
1334
1319
1335
``` python
1320
1336
from jina.types.arrays.memmap import DocumentArrayMemmap
@@ -1332,33 +1348,151 @@ print(dam[0].text)
1332
1348
```
1333
1349
1334
1350
``` text
1335
- hello
1336
- ```
1337
-
1338
- One can see the ` text ` field has not changed!
1339
-
1340
- To update an existing ` Document ` in a ` DocumentArrayMemmap ` , you need to assign it to a new ` Document ` object.
1351
+ goodbye
1352
+ ```
1353
+
1354
+ However, there are practices to ** avoid** . Mainly, you should not modify documents that you reference manually and that
1355
+ might not be in the buffer pool. Here are some practices to avoid:
1356
+
1357
+ 1 . Keep more references than the buffer pool size and modify them:
1358
+ <table >
1359
+ <tr >
1360
+ <td >
1361
+ <b ><center >❌ Don't</center ></b >
1362
+ </td >
1363
+ <td >
1364
+ <b ><center >✅ Do</center ></b >
1365
+ </td >
1366
+ </tr >
1367
+ <tr >
1368
+ <td >
1369
+
1370
+ ``` python
1371
+ from jina import Document
1372
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1373
+
1374
+ docs = [Document(text = ' hello' ) for _ in range (100 )]
1375
+ dam = DocumentArrayMemmap(' ./my-memmap' , buffer_pool_size = 10 )
1376
+ dam.extend(docs)
1377
+ for doc in docs:
1378
+ doc.text = ' goodbye'
1379
+
1380
+ dam[50 ].text
1381
+ ```
1382
+ ``` text
1383
+ hello
1384
+ ```
1385
+
1386
+
1387
+ </td>
1388
+ <td>
1389
+ Use the dam object to modify instead:
1390
+
1391
+ ```python
1392
+ from jina import Document
1393
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1394
+
1395
+ docs = [Document(text='hello') for _ in range(100)]
1396
+ dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1397
+ dam.extend(docs)
1398
+ for doc in dam:
1399
+ doc.text = 'goodbye'
1400
+
1401
+ dam[50].text
1402
+ ```
1403
+ ``` text
1404
+ goodbye
1405
+ ```
1406
+
1407
+ It's also okay if you reference docs less than the buffer pool size:
1408
+
1409
+ ``` python
1410
+ from jina import Document
1411
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1412
+
1413
+ docs = [Document(text = ' hello' ) for _ in range (100 )]
1414
+ dam = DocumentArrayMemmap(' ./my-memmap' , buffer_pool_size = 1000 )
1415
+ dam.extend(docs)
1416
+ for doc in docs:
1417
+ doc.text = ' goodbye'
1418
+
1419
+ dam[50 ].text
1420
+ ```
1421
+ ``` text
1422
+ goodbye
1423
+ ```
1424
+
1425
+ </td >
1426
+ </tr >
1427
+ </table >
1428
+
1429
+
1430
+ 2 . Modify a reference that might have left the buffer pool :
1431
+ <table >
1432
+ <tr >
1433
+ <td >
1434
+ <b ><center >❌ Don't</center ></b >
1435
+ </td >
1436
+ <td >
1437
+ <b ><center >✅ Do</center ></b >
1438
+ </td >
1439
+ </tr >
1440
+ <tr >
1441
+ <td >
1442
+
1443
+ ``` python
1444
+ from jina import Document
1445
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1446
+
1447
+ dam = DocumentArrayMemmap(' ./my-memmap' , buffer_pool_size = 10 )
1448
+ my_doc = Document(text = ' hello' )
1449
+ dam.append(my_doc)
1450
+
1451
+ # my_doc leaves the buffer pool after extend
1452
+ dam.extend([Document(text = ' hello' ) for _ in range (99 )])
1453
+ my_doc.text = ' goodbye'
1454
+ dam[0 ].text
1455
+ ```
1456
+ ``` text
1457
+ hello
1458
+ ```
1459
+
1460
+
1461
+ </td>
1462
+ <td>
1463
+ Get the document from the dam object and then modify it:
1464
+
1465
+ ```python
1466
+ from jina import Document
1467
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1468
+
1469
+ dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
1470
+ my_doc = Document(text='hello')
1471
+ dam.append(my_doc)
1472
+
1473
+ # my_doc leaves the buffer pool after extend
1474
+ dam.extend([Document(text='hello') for _ in range(99)])
1475
+ dam[my_doc.id].text = 'goodbye' # or dam[0].text = 'goodbye'
1476
+ dam[0].text
1477
+ ```
1478
+ ``` text
1479
+ goodbye
1480
+ ```
1481
+
1482
+ </td >
1483
+ </tr >
1484
+ </table >
1485
+
1486
+ To summarize, it's a best practice to ** rely on the ` dam ` object to reference the docs that you modify** .
1341
1487
1342
- ``` python
1343
- from jina.types.arrays.memmap import DocumentArrayMemmap
1344
- from jina import Document
1345
-
1346
- d1 = Document(text = ' hello' )
1347
- d2 = Document(text = ' world' )
1348
-
1349
- dam = DocumentArrayMemmap(' ./my-memmap' )
1350
- dam.extend([d1, d2])
1488
+ ### Clear a ` DocumentArrayMemmap `
1351
1489
1352
- dam[ 0 ] = Document( text = ' goodbye ' )
1490
+ To clear all contents in a ` DocumentArrayMemmap ` object, simply call ` .clear() ` . It will clean all content on disk.
1353
1491
1354
- for d in dam:
1355
- print (d)
1356
- ```
1492
+ #### Pruning
1357
1493
1358
- ``` text
1359
- {'id': '44a74b56-c821-11eb-8522-1e008a366d48', 'mime_type': 'text/plain', 'text': 'goodbye'}
1360
- {'id': '44a73562-c821-11eb-8522-1e008a366d48', 'mime_type': 'text/plain', 'text': 'world'}
1361
- ```
1494
+ One may notice another method ` .prune() ` that shares similar semantics. ` .prune() ` method is designed for "
1495
+ post-optimizing" the on-disk data structure of ` DocumentArrayMemmap ` object. It can reduce the on-disk usage.
1362
1496
1363
1497
### Side-by-side vs. ` DocumentArray `
1364
1498
@@ -1407,12 +1541,12 @@ dam.extend(da)
1407
1541
da = DocumentArray(dam)
1408
1542
```
1409
1543
1410
- ### Maintaining consistency via ` .reload() `
1544
+ ### Maintaining consistency via ` .reload() ` and ` .save() `
1411
1545
1412
1546
Considering two ` DocumentArrayMemmap ` objects that share the same on-disk storage ` ./memmap ` but sit in different
1413
- processes/threads. After some writing ops, the consistency of the lookup table may be corrupted, as
1414
- each ` DocumentArrayMemmap ` object has its own version of lookup table in memory. ` .reload() ` method is for solving this
1415
- issue:
1547
+ processes/threads. After some writing ops, the consistency of the lookup table and the buffer pool may be corrupted, as
1548
+ each ` DocumentArrayMemmap ` object has its own version of lookup table and buffer pool in memory. ` .reload() ` and
1549
+ ` .save() ` are for solving this issue:
1416
1550
1417
1551
``` python
1418
1552
from jina.types.arrays.memmap import DocumentArrayMemmap
@@ -1438,3 +1572,28 @@ assert len(dam2) == 2
1438
1572
dam2.reload()
1439
1573
assert len (dam2) == 0
1440
1574
```
1575
+ You don't need to use ` .save ` if you add new documents. However, if you modified an attribute of a document, you need
1576
+ to use it:
1577
+
1578
+ ``` python
1579
+ from jina.types.arrays.memmap import DocumentArrayMemmap
1580
+ from jina import Document
1581
+
1582
+ d1 = Document(text = ' hello' )
1583
+
1584
+ dam = DocumentArrayMemmap(' ./my-memmap' )
1585
+ dam2 = DocumentArrayMemmap(' ./my-memmap' )
1586
+
1587
+ dam.append(d1)
1588
+ d1.text = ' goodbye'
1589
+ assert len (dam) == 1
1590
+ assert len (dam2) == 0
1591
+
1592
+ dam2.reload()
1593
+ assert len (dam2) == 1
1594
+ assert dam2[0 ].text == ' hello'
1595
+
1596
+ dam.save()
1597
+ dam2.reload()
1598
+ assert dam2[0 ].text == ' goodbye'
1599
+ ```
0 commit comments