This helper class uses Solr’s cursorMark (Solr 4.9+) to re-index collections or to dump out your collection to filesystem. It is useful if you want to get an offline snapshot of your data. Additionally, you will need to re-index your data to upgrade lucene indexes and this is a handy way to do it.
In the most basic way; it will sort your items by id and page through the results in batches of rows and concurrently send data to the destiation. Destination can be either another Solr collection or an IndexQ instance. If it is another Solr collection you have to make sure that it is configured exactly as the first one. Keep in mind that if items are added or modified while you are performing this operation; they may not be captured. So it is advised to stop indexing while you are running it.
If you are keeping the document’s index timestamp, with something like:
<field name="last_update" type="date" indexed="true" stored="true" default="NOW" />
You can specify that field through date_field parameter. If it is supplied, the Reindexer will include the date_field in the sort and start re-indexing starting with the oldest documents. This way new items will also be picked up. Note that deletions will not be carried over so it is still adviced to stop indexing.
Using this will also allows you to resume the reindexing if it gets interrupted for some reason through the resume method.
On the resume, it will run several range facet queries to compare the counts based on date ranges and only re-process the ranges that have missing documents.
Reindexer(source, dest, source_coll=None, dest_coll=None, rows=1000, date_field=None, devel=False, per_shard=False, ignore_fields=['_version_'])¶
Initiates the re-indexer.
- source – An instance of SolrClient.
- dest – An instance of SolrClient or an instance of IndexQ.
- source_coll (string) – Source collection name.
- dest_coll (string) – Destination collection name; only required if destination is SolrClient.
- rows (int) – Number of items to get in each query; default is 1000, however you will probably want to increase it.
- date_field (string) – String name of a Solr date field to use in sort and resume.
- devel (bool) – Whenever to turn on super verbouse logging for development. Standard DEBUG should suffice for most developemnt.
- per_shard (bool) – Will add distrib=false to each query to get the data. Use this only if you will be running multiple instances of this to get the rest of the shards.
- ignore_fields (list) – What fields to exclude from Solr queries. This is important since if you pull them out, you won’t be able to index the documents in.
By default, it will try to determine and exclude copy fields as well as _version_. Pass in your own list to override or set it to False to prevent it from doing anything.
Starts Reindexing Process. All parameter arguments will be passed down to the getter function. :param string fq: FilterQuery to pass to source Solr to retrieve items. This can be used to limit the results.
resume(start_date=None, end_date=None, timespan='DAY', check=False)¶
This method may help if the original run was interrupted for some reason. It will only work under the following conditions * You have a date field that you can facet on * Indexing was stopped for the duration of the copy
The way this tries to resume re-indexing is by running a date range facet on the source and destination collections. It then compares the counts in both collections for each timespan specified. If the counts are different, it will re-index items for each range where the counts are off. You can also pass in a start_date to only get items after a certain time period. Note that each date range will be indexed in it’s entirety, even if there is only one item missing.
Keep in mind this only checks the counts and not actual data. So make the indexes weren’t modified between the reindexing execution and running the resume operation.
Parameters: start_date – Date to start indexing from. If not specified there will be no restrictions and all data will be processed. Note that
this value will be passed to Solr directly and not modified. :param end_date: The date to index items up to. Solr Date Math compliant value for faceting; currenlty only DAY is supported. :param timespan: Solr Date Math compliant value for faceting; currenlty only DAY is supported. :param check: If set to True it will only log differences between the two collections without actually modifying the destination.