diff doc/libervia-cli/pubsub_cache.rst @ 3715:b9718216a1c0 0.9

merge bookmark 0.9
author Goffi <goffi@goffi.org>
date Wed, 01 Dec 2021 16:13:31 +0100
parents d0b66efc6c0e
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/libervia-cli/pubsub_cache.rst	Wed Dec 01 16:13:31 2021 +0100
@@ -0,0 +1,350 @@
+.. _libervia-cli_pubsub_cache:
+
+=====================================
+pubsub/cache: PubSub Cache Management
+=====================================
+
+Libervia runs transparently a cache for pubsub. That means that according to internal
+criteria, some pubsub items are stored locally.
+
+The ``cache`` subcommands let user inspect and manipulate the internal cache.
+
+get
+===
+
+Retrieve items from internal cache only. Most end-users won't need to use this command, as
+the usual ``pubsub get`` command will use cache transparently. However, it may be useful
+to inspect local cache, notably for debugging.
+
+The parameters are basically the same as for :ref:`li_pubsub_get`.
+
+example
+-------
+
+Retrieve the last 2 cached items for personal blog::
+
+    $ li pubsub cache get -n urn:xmpp:microblog:0 -M 2
+
+.. _li_pubsub_cache_sync:
+
+sync
+====
+
+Synchronise or resynchronise a pubsub node. If the node is already in cache, it will be
+deleted then re-cached. Node will be put in cache even if internal policy doesn't request
+a synchronisation for this kind of nodes. Node will be (re-)subscribed to keep cache
+synchronised.
+
+All items of the node (up to the internal limit which is high), will be retrieved and put
+in cache, even if a previous version of those items have been deleted by the
+:ref:`li_pubsub_cache_purge` command.
+
+
+example
+-------
+
+Resynchronise personal blog::
+
+    $ li pubusb cache sync -n urn:xmpp:microblog:0
+
+.. _li_pubsub_cache_purge:
+
+purge
+=====
+
+Remove items from cache. This may be desirable to save resource, notably disk space.
+
+Note that once a pubsub node is cached, the cache is the source of trust. That means that
+if cache is not explicitly bypassed when retrieving items of a pubsub node (notably with
+the ``-C, --no-cache`` option of :ref:`li_pubsub_get`), only items found in cache will be
+returned, thus purged items won't be used or returned anymore even if they still exists on
+the original pubsub service.
+
+If you have purged items by mistake, it is possible to retrieve them either node by node
+using :ref:`li_pubsub_cache_sync`, or by resetting the whole pubsub cache with
+:ref:`li_pubsub_cache_reset`.
+
+If you have a node or a profile (e.g. a component) caching a lot of items frequently, you
+may use this command using a scheduler like cron_.
+
+.. _cron: https://en.wikipedia.org/wiki/Cron
+
+examples
+--------
+
+Remove all blog and event items from cache if they haven't been updated since 6 months::
+
+    $ li pubsub cache purge -t blog -t event -b "6 months ago"
+
+Remove items from profile ``ap_gateway`` if they have been created more that 2 months
+ago::
+
+    $ li pubsub cache purge -p ap_gateway --created-before "2 months ago"
+
+.. _li_pubsub_cache_reset:
+
+reset
+=====
+
+Reset the whole pubsub cache. This means that all nodes and all them items will be removed
+from cache. After this command, cache will be re-filled progressively as if it where a new
+one.
+
+.. note::
+
+    Use this command with caution: even if cache will be re-constructed with time, that
+    means that items will have to be retrieved again, that may be resource intensive both
+    for your machine and for the pubsub services which will be used. That also means that
+    searching items will return less results until all desired items are cached again.
+
+    Also note that all items of cached nodes are retrieved, even if you have previously
+    purged items, they will be retrieved again.
+
+example
+-------
+
+Reset the whole pubsub cache::
+
+    $ li pubsub cache reset
+
+search
+======
+
+Search items into pubsub cache. The search is done on the whole cache, it's not restricted
+to a single node/profile (even if it may be if suitable filters are specified). Full-Text
+Search can be done with ``-f FTS, --fts FTS`` argument, as well as filtering on parsed
+data (with ``-F PATH OPERATOR VALUE, --field PATH OPERATOR VALUE``, see below).
+
+By default, parsed data are returned, with the 3 additional keys ``pubsub_service``,
+``pubsub_items`` (the search being done on the whole cache, those data are here to get the
+full location of each item) and ``node_profile``.
+
+"Parsed data" are the result of the parsing of the items XML payload by feature aware
+plugins. Those data are usually more readable and easier to work with. Parsed data are
+only stored when a parser is registered for a specific feature, that means that a Pubsub
+item in cache may not have parsed data at all, in which case an empty dict will be used
+instead (and ``-P, --payload`` argument should be used to get content of the item).
+
+The dates are normally stored as `Unix time`_ in database, but the default output convert
+the ``updated``, ``created`` and ``published`` fields to human readable local time. Use
+``--output simple`` if you want to keep the float (or int) value.
+
+XML item payload is not returned by default, but it can be added to the ``item_payload``
+field if ``-P, --payload`` argument is set. You can also use the ``--output xml`` (or
+``xml_raw`` if you don't want prettifying) to output directly the highlighted XML
+— without the parsed data —, to have an output similar to the one of ``li pubsub get``.
+
+If you are interested only in a specific data (e.g. item id and title), the ``-k KEY,
+--key KEY`` can be used.
+
+You'll probably want to limit result size by using ``-l LIMIT, --limit LIMIT``, and do
+pagination using ``-i INDEX, --index INDEX``.
+
+.. _Unix time: https://en.wikipedia.org/wiki/Unix_time
+
+Filters
+-------
+
+By default search returns all items in cache, you have to use filter to specify what you
+are looking after. We can split filters in 3 categories: nodes/items metadata,
+Full-Text Search query and parsed metadata.
+
+Nodes/items metadata are the generic information you have on a node: which profile it
+belong too, which pubsub service it's coming from, what's the name or type of the node,
+etc.
+
+Arguments there should be self-explanatory. Type (set with ``-t TYPE, --type TYPE``) and
+subtype (set with ``-S SUBTYPE, --subtype SUBTYPE``) are values dependent of the
+plugin/feature associated with the node, so we can't list them in an exhaustive way here.
+The most common type is probably ``blog``, from which a subtype can be ``comment``. An
+empty string can be used to find items with (sub)type not set.
+
+It's usually a good idea to specify a profile with ``-p PROFILE, --profile PROFILE``,
+otherwise you may get duplicated results.
+
+Full-Text Search
+----------------
+
+You can specify a Full-Text Search query with the ``-f FTS_QUERY, --fts FTS_QUERY``
+argument. The engine is currently SQLite FTS5, and you can check its `query syntax`_.
+FTS is done on the whole raw XML payload, that means that all data there can be matched
+(including XML tags and attributes).
+
+FTS queries are indexed, that means that they are fast and efficient.
+
+.. note::
+
+  Futures version of Libervia will probably include other FTS engines (support for
+  PostgreSQL and MySQL/MariaDB is planned). Thus the syntax may vary depending on the
+  engine, or a common syntax may be implemented for all engines in the future. Keep that
+  in mind if you plan to use FTS capabilities in long-term queries, e.g. in scripts.
+
+.. _query syntax: https://sqlite.org/fts5.html#full_text_query_syntax
+
+Parsed Metadata Filters
+-----------------------
+
+It is possible to filter on any field of parsed data. This is done with the ``-F PATH
+OPERATOR VALUE, --field PATH OPERATOR VALUE`` (be careful that the short option is an
+uppercase ``F``, the lower case one being used for Full-Text Search).
+
+.. note::
+
+  Parsed Metadata Filters are not indexed, that means that using them is less efficient
+  than using e.g. Full-Text Search. If you want to filter on a text field, it's often a
+  good idea to pre-filter using Full-Text Search to have more efficient queries.
+
+``PATH`` and ``VALUE`` can be either specified as string, or using JSON syntax (if the
+value can't be decoded as JSON, it is used as plain text).
+
+``PATH`` is the name of the field to use. If you must go beyond root level fields, you can
+use a JSON array to specify each element of the path. If a string is used, it's an object
+key, if a number is used it's an array index. Thus you can use ``title`` to access the
+root title key, or ``'"title"'`` (JSON string escaped for shell) or ``'["title"]'`` (JSON
+array with the "title" string, escaped for shell).
+
+.. note::
+
+  The extra fields ``pubsub_service``, ``pubsub_node`` and  ``node_profile`` are added to
+  the result after the query, thus they can't be used as fields for filtering (use the
+  direct arguments for that).
+
+``OPERATOR`` indicate how to use the value to make a filter. The currently supported
+operators are:
+
+``==`` or ``eq``
+  Equality operator, true if field value is the same as given value.
+
+``!=`` or ``ne``
+  Inequality operator, true if the field value is different from given value.
+
+``>`` or ``gt``
+  Greater than, true if the field value is higher than given value. For string, this is
+  according to alphabetical order.
+
+  Time Pattern can be used here, see below.
+
+``<`` or ``lt``
+  Lesser than, true if the field value is lower than given value. For string, this is
+  according to alphabetical order.
+
+  Time Pattern can be used here, see below.
+
+``between``
+  Given value must be an array with 2 elements. The condition is true if field value is
+  between the 2 elements (for string, this is according to alphabetical order).
+
+  Time Pattern can be used here, see below.
+
+``in``
+  Given value must be an array of elements. Field value must be one of them to make the
+  condition true.
+
+``not_in``
+  Given value must be an array of elements. Field value must not be any of them the make
+  the condition true.
+
+``overlap``
+  This can be used only on array fields.
+
+  If given value is not already an array, it is put in an array. Condition is true if any
+  element of field value match any element of given value. Notably useful to filter on
+  tags.
+
+``ioverlap``
+  Same as ``overlap`` but done in a case insensitive way.
+
+``disjoint``
+  This can be used only on array fields.
+
+  If given value is not already an array, it is put in an array. Condition is true if no
+  element of field value match any element of given value. Notably useful to filter out
+  tags.
+
+``idisjoint``
+  Same as ``disjoint`` but done in a case insensitive way.
+
+``like``
+  Does pattern matching on a string. ``%`` can be used to match zero or more characters
+  and ``_`` can be used to match any single character.
+
+  If you're not looking after a specific field, it's better to use Full-Text Search when
+  possible.
+
+``ilike``
+  Like ``like`` but done in a case insensitive way.
+
+
+``not_like``
+  Same as ``like`` except that condition is true when pattern is **not** matching.
+
+``not_ilike``
+  Same as ``not_like`` but done in a case insensitive way.
+
+
+For ``gt``/``>``, ``lt``/``<`` and ``between``, you can use :ref:`time_pattern` by using
+the syntax ``TP(<time pattern>)`` (see examples below).
+
+Ordering
+--------
+
+Result ordering can be done by a well know order, or using a parsed data field. Ordering
+default to ``created`` (see below), but this may be changed with ``-o ORDER [FIELD]
+[DIRECTION], --order-by ORDER [FIELD] [DIRECTION]``.
+
+``ORDER`` can be one of the following:
+
+``creation``
+  Order by item creation date. Note that is this the date of creation of the item in cache
+  (which most of time should correspond to order of creation of the item in the source
+  pubsub service), and this may differ from the date of publication as specified with some
+  feature (like blog). This is important when old items are imported, e.g. when they're
+  coming from an other blog engine.
+
+``modification``
+  Order by the date when item has last been modified. Modification date is the same as
+  creation date if the item has never been modified since it is in cache. The same warning
+  as for ``creation`` applies: this is the date of last modification in cache, not the one
+  advertised in parsed data.
+
+``item_id``
+  Order by XMPP id of the item. Notably useful when user-friendly ID are used (like it is
+  often the case with blogs).
+
+``rank``
+  Order item by Full-Text Search rank. This one can only be used when Full-Text Search is
+  used (via ``-f FTS_QUERY, --fts FTS_QUERY``). Rank is a value indicating how well an
+  item match the query. This usually needs to be used with ``desc`` direction, so you get
+  most relevant items first.
+
+``field``
+  This special order indicates that the ordering must be done on an parsed data field. The
+  following argument is then the path of the field to used (which can be a plain text name
+  of a root field, or a JSON encoded array). An optional direction can be specified as a
+  third argument. See examples below.
+
+examples
+--------
+
+Search for blog items cached for the profile ``louise`` which contain the word
+``Slovakia``::
+
+  $ li pubsub cache search -t blog -p louise -f Slovakia
+
+Show title, publication date and id of blog articles (excluding comments) which have been
+published on Louise's blog during the last 6 months, order them by item id. Here we use an
+empty string as a subtype to exclude comments (for which subtype is ``comment``)::
+
+  $ li pubsub cache search -t blog -S "" -p louise -s louise@example.net -n urn:xmpp:microblog:0 -F published gt 'TP(6 months ago)' -k id -k published -k title -o item_id
+
+Show all blog items from anywhere which are tagged as XMPP or ActivityPub (case
+insensitive) and which have been published in the last month (according to advertised
+publishing date, not cache creation date).
+
+We want to order them by descending publication date (again the advertised publication
+date, not cache creation), and we don't want more than 50 results.
+
+We do a FTS query there even if it's not mandatory, because it will do an efficient
+pre-filtering::
+
+  $ li pubsub cache search -f "xmpp OR activitypub" -F tags ioverlap '["xmpp", "activitypub"]' -F published gt 'TP(1 month ago)' -o field published desc -l 50