Storage pools rebalancing PG counts

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Storage pools rebalancing PG counts

rdowell
This post was updated on .
Someone on our QA team recently noticed that in a Ceph cluster that had been imported into VSM, pools created outside of VSM (manually created with the Ceph CLI) were having their PG and PGP counts altered unexpectedly.  In our install scenario we can't have this happen, as our PM has stated that we don't want PG counts on storage pools to be altered without user input.

After some digging I found that it was occurring in this function:

source/vsm/vsm/agent/manager.py:AgentManager:update_pg_and_pgp()

    def update_pg_and_pgp(self, context):
        LOG.info("Updating pg_num and pg_placement_num for pools")
        db_pools = db.pool_get_all(context)
        for pool in db_pools:
            # storage_group = db.storage_group_get_by_rule_id(context, \
            #                                         pool['crush_ruleset'])
            # if storage_group:
            crushmap = self.get_crushmap_json_format()
            osd_num_per_group = crushmap.osd_count_by_rule_id( pool['crush_ruleset'])
            #reset pgs
            max_pg_num_per_osd = pool['max_pg_num_per_osd']
            if not max_pg_num_per_osd:
                settings = db.vsm_settings_get_all(context)
                for setting in settings:
                    if setting['name'] == 'pg_count_factor':
                         max_pg_num_per_osd = int(setting['value'])
            auto_growth_pg = pool['auto_growth_pg']
            if auto_growth_pg:
                max_pg_num_finally = auto_growth_pg
            else:
                size = pool['size']
                if pool['size'] == 0:
                    pool_default_size = db.vsm_settings_get_by_name(context,'osd_pool_default_size')
                    size = int(pool_default_size.value)
                max_pg_num_finally = max_pg_num_per_osd * osd_num_per_group//size
            if max_pg_num_finally > pool['pg_num']:
                pg_num = max_pg_num_finally#self._compute_pg_num(context, osd_num_per_group, pool['size'])
                LOG.info("pool['crush_ruleset'] id %s has %s osds" % (pool['crush_ruleset'], osd_num_per_group))
                if osd_num_per_group > 64:
                    osd_num_per_group = 64
                    LOG.info("osd_num_per_group > 64, use osd_num_per_group=64")
                step_max_pg_num = osd_num_per_group * 32
                max_pg_num = step_max_pg_num + pool['pg_num']
                if pg_num > max_pg_num_finally:
                    pgp_num = pg_num = max_pg_num_finally
                    self.set_pool_pg_pgp_num(context, pool['name'], pg_num, pgp_num)
                elif pg_num > max_pg_num:
                    pgp_num = pg_num = max_pg_num
                    self.set_pool_pg_pgp_num(context, pool['name'], pg_num, pgp_num)
                elif pg_num > pool['pg_num']:
                    pgp_num = pg_num
                    self.set_pool_pg_pgp_num(context, pool['name'], pg_num, pgp_num)
                else:
                    continue
                ...
                ...

I tried modifying our local copy of VSM to disable this function, but then found a new issue.  With this function disabled, storage pools created in the VSM GUI were always being created with a PG count of 64, regardless of what the user selected in the pool creation dialog.  After more digging I found the cause here:

 source/vsm/vsm/api/v1/storage_pool.py:StoragePoolController:create()

    def create(self, req, body=None):
        """Create a storage pool."""
        LOG.info(body)
        ...
        ...
        is_ec_pool = pool_dict.get('ecProfileId')
        if is_ec_pool:
            #erasure code pool 
            body_info = {'name': name,
                        'cluster_id':cluster_id,
                        'storage_group_id':storage_group_id,
                        'storage_group_name':storage_group_name,
                        'ec_profile_id':pool_dict['ecProfileId'],
                        'ec_ruleset_root':storage_group['name'],
                        'ec_failure_domain':pool_dict['ecFailureDomain'],
                        'created_by':created_by,
                        'tag':tag}
             
        else:
            #replicated pool 
            crush_ruleset = rule_id#self.conductor_api.get_ruleset_id(context, storage_group_id)
            if crush_ruleset < 0:
                msg = _('crush_ruleset must be a non-negative integer value')
                raise exc.HTTPBadRequest(explanation=msg)

            #size = pool_dict['replicationFactor']
            #replica_storage_group_id = pool_dict['replicatedStorageGroupId']
            #try:
            #     size = int(str(size))
            #     if size < 1:
            #         msg = _('size must be > 1')
            #         raise exc.HTTPBadRequest(explanation=msg)
            #
            #     host_num = self.conductor_api.count_hosts_by_storage_group_id(context, storage_group_id)
            #     LOG.info("storage_group_id:%s,host_num:%s", storage_group_id, host_num)
            #     if size > host_num:
            #         msg = "The replication factor must be less than or equal to the number of storage nodes in the specific storage group in cluster!"
            #         return {'message': msg}
            # except ValueError:
            #     msg = _('size must be an interger value')
            #     raise exc.HTTPBadRequest(explanation=msg)

            #pg_num = self._compute_pg_num(context, osd_num, size)

            #vsm_id = str(uuid.uuid1()).split('-')[0]
            pg_num = 64
            auto_growth_pg = pool_dict.get("auto_growth_pg",0)
            if  auto_growth_pg and int(auto_growth_pg) < pg_num and int(auto_growth_pg) > 0:
                pg_num = int(auto_growth_pg)
            #self._compute_pg_num(context, osd_num, size)
            body_info = {'name': name, #+ "-vsm" + vsm_id,
                        'cluster_id':cluster_id,
                        'storage_group_id':storage_group_id,
                        'storage_group_name':storage_group_name,
                        'pool_type':'replicated',
                        'crush_ruleset':crush_ruleset,
                        'pg_num':pg_num,
                        'pgp_num':pg_num,
                        'size':size,
                        'min_size':size,
                        'created_by':created_by,
                        'tag':tag}

        body_info.update({
            "quota": pool_dict.get("poolQuota"),
            "enable_quota": pool_dict.get("enablePoolQuota"),
            "max_pg_num_per_osd": pool_dict.get("max_pg_num_per_osd") or 100,
            "auto_growth_pg": pool_dict.get("auto_growth_pg") or 0,
        })
        #LOG.info('body_info=====%s'%body_info)
        return self.scheduler_api.create_storage_pool(context, body_info)

Based on this function, replicated pools are always created with only 64 PG's initially, and expected to be rebalanced to the user-selected (or auto-selected) PG count later via the update_pg_and_pgp function.

I would just like some clarification on this behavior - what is the reasoning behind having VSM automatically rebalance PG counts on all storage pools?  Why are new pools, even with the PG count explicitly designated, only created with 64 PG's and then later updated to the 'correct' PG count?  Also, given the logic in the update_pg_and_pgp routine, it is clear that all pools, even those that aren't created with the 'auto-tune PG' flag selected in the pool creation dialog in the GUI, will have their PG counts rebalanced.  Is this an error or does this routine need to be enhanced so that it only auto-tunes pools for which this behavior is desired?
Reply | Threaded
Open this post in threaded view
|

Re: Storage pools rebalancing PG counts

rdowell
I'm gonna try to elaborate on this to make my point more clear, because I am confused about the original purpose of the functionality, and the implementation doesn't seem to match the intention:

This is the replicated pool creation dialog in the GUI:



When viewing the dialog, my assumption of the wording "Enable PG auto-tune" was that checking the box would enable VSM to automatically retune the PG count of the pool to the optimal level if the cluster configuration changed in the future

The GUI code indicates that the checkbox only actually controls whether the PG value in the dialog box is read-only or not:

source/vsm-dashboard/vsm_dashboard/dashboards/vsm/poolsmanagement/templates/poolsmanagement/create_replicated_pool.html

                    <div class="control-group form-field clearfix ">
                        <label>Enable PG Auto-tune:</label>
                        <input id="chkPGNumber" type="checkbox" checked="checked" onchange="ChangeEableAutoQuto(this.checked)" />
                        <span class="help-block"></span>
                        <div class="input">
                          	<input id="txtPGNumber" maxlength="255"  type="number" class="form-control"  readOnly>
                        </div>
                    </div>

source/vsm-dashboard/static/dashboard/js/addpool.js

function ChangeEableAutoQuto(checked){
	$("#txtPGNumber")[0].readOnly = checked
}

I should also point out that the checkbox defaults to checked, but if it is unchecked, and the PG number is edited, and then it is checked again, the value does not revert to the auto-calculated number, and the user's edited PG count remains in the dialog box

When submitting the pool creation command, the value in the dialog box is sent to the API, and the value of the checkbox is not used:

source/vsm-dashboard/static/dashboard/js/addpool.js

function AddPool(){
    //Check the field is should not null
    if($("#txtPoolName").val() == ""){
        showTip("error","The field is marked as '*' should not be empty");
        return  false;
    }
    var data = {
            "pool":{
                "name":$("#txtPoolName").val(),
                "storageGroupId":$("#selStorageGroup").val(),
                'storageGroupName': $("#selStorageGroup")[0].options[$("#selStorageGroup")[0].selectedIndex].text,
                "replicatedStorageGroupId":'replica',
                'auto_growth_pg': $("#txtPGNumber").val(),
                'tag': $("#txtTag").val(),
                'clusterId': "0",
                'createdBy': "VSM",
                'enablePoolQuota': $("#chkPoolQuota")[0].checked,
                'poolQuota': $("#txtPoolQuota").val(),
            }
    }

As this code indicates, the value of 'auto_growth_pg' is always set to the 'txtPGNumber' field, regardless of whether the user left the "Enable auto-tune PG's" checkbox checked.

The end result of all this is that the 'auto_growth_pg' value will ALWAYS be set to a nonzero value unless the user explicitly unchecks the "Enable PG auto-tune" checkbox and enters 0 deliberately.

In the API, the 'auto_growth_pg' field is read as a 'target' value for the PG number for the pool, and the pool is always created with 64 PG's regardless of the user's input, unless the initial target value is less than 64

source/vsm/vsm/api/v1/storage_pool.py::StoragePoolController:create()

        else:
            #replicated pool 
            crush_ruleset = rule_id#self.conductor_api.get_ruleset_id(context, storage_group_id)
            if crush_ruleset < 0:
                msg = _('crush_ruleset must be a non-negative integer value')
                raise exc.HTTPBadRequest(explanation=msg)

            #size = pool_dict['replicationFactor']
            #replica_storage_group_id = pool_dict['replicatedStorageGroupId']
            #try:
            #     size = int(str(size))
            #     if size < 1:
            #         msg = _('size must be > 1')
            #         raise exc.HTTPBadRequest(explanation=msg)
            #
            #     host_num = self.conductor_api.count_hosts_by_storage_group_id(context, storage_group_id)
            #     LOG.info("storage_group_id:%s,host_num:%s", storage_group_id, host_num)
            #     if size > host_num:
            #         msg = "The replication factor must be less than or equal to the number of storage nodes in the specific storage group in cluster!"
            #         return {'message': msg}
            # except ValueError:
            #     msg = _('size must be an interger value')
            #     raise exc.HTTPBadRequest(explanation=msg)

            #pg_num = self._compute_pg_num(context, osd_num, size)

            #vsm_id = str(uuid.uuid1()).split('-')[0]
            pg_num = 64
            auto_growth_pg = pool_dict.get("auto_growth_pg",0)
            if  auto_growth_pg and int(auto_growth_pg) < pg_num and int(auto_growth_pg) > 0:
                pg_num = int(auto_growth_pg)
            #self._compute_pg_num(context, osd_num, size)
            body_info = {'name': name, #+ "-vsm" + vsm_id,
                        'cluster_id':cluster_id,
                        'storage_group_id':storage_group_id,
                        'storage_group_name':storage_group_name,
                        'pool_type':'replicated',
                        'crush_ruleset':crush_ruleset,
                        'pg_num':pg_num,
                        'pgp_num':pg_num,
                        'size':size,
                        'min_size':size,
                        'created_by':created_by,
                        'tag':tag}

        body_info.update({
            "quota": pool_dict.get("poolQuota"),
            "enable_quota": pool_dict.get("enablePoolQuota"),
            "max_pg_num_per_osd": pool_dict.get("max_pg_num_per_osd") or 100,
            "auto_growth_pg": pool_dict.get("auto_growth_pg") or 0,
        })
        #LOG.info('body_info=====%s'%body_info)
        return self.scheduler_api.create_storage_pool(context, body_info)

In this code, auto_growth_pg will always be nonzero for a replicated pool (unless the user explicitly sets it to 0).  Incidentally, it will also always be 0 for an EC pool, since there is no PG number specified when creating an EC pool

In the Agent, a nonzero value for auto_growth_pg actually means that the pool will NOT rebalance, except for the first time, when a replicated pool is updated from the initial 64 PG's to the target value.  A 0 value for auto_growth_pg actually means that the pool WILL rebalance, for both replicated and EC pools, every time the cluster config is changed

source/vsm/vsm/agent/manager.py::AgentManager:update_pg_and_pgp()

            if auto_growth_pg:
                max_pg_num_finally = auto_growth_pg
            else:
                size = pool['size']
                if pool['size'] == 0:
                    pool_default_size = db.vsm_settings_get_by_name(context,'osd_pool_default_size')
                    size = int(pool_default_size.value)
                max_pg_num_finally = max_pg_num_per_osd * osd_num_per_group//size
            if max_pg_num_finally > pool['pg_num']:
                pg_num = max_pg_num_finally#self._compute_pg_num(context, osd_num_per_group, pool['size'])
                LOG.info("pool['crush_ruleset'] id %s has %s osds" % (pool['crush_ruleset'], osd_num_per_group))
                if osd_num_per_group > 64:
                    osd_num_per_group = 64
                    LOG.info("osd_num_per_group > 64, use osd_num_per_group=64")
                step_max_pg_num = osd_num_per_group * 32
                max_pg_num = step_max_pg_num + pool['pg_num']
                if pg_num > max_pg_num_finally:
                    pgp_num = pg_num = max_pg_num_finally
                    self.set_pool_pg_pgp_num(context, pool['name'], pg_num, pgp_num)
                elif pg_num > max_pg_num:
                    pgp_num = pg_num = max_pg_num
                    self.set_pool_pg_pgp_num(context, pool['name'], pg_num, pgp_num)
                elif pg_num > pool['pg_num']:
                    pgp_num = pg_num
                    self.set_pool_pg_pgp_num(context, pool['name'], pg_num, pgp_num)
                else:
                    continue

The end result here is that the "Enable Auto-tune PG" checkbox when creating a replicated pool has only one effect - letting the user choose their own PG count or accepting the default.  Regardless of which they decide, that initial PG count will be used and it will never actually be "auto-tuned" after the first time when it is updated from 64 to whatever the target is.  An EC pool will always get rebalanced in the agent logic because they will always have auto_growth_pg set to 0.

Any pools NOT created by VSM (created by the user with the Ceph CLI) will also have auto_growth_pg set to 0:

source/vsm/vsm/agent/manager.py::AgentManager:update_pool_status()

        # Pools in ceph, not int db.
        # Add it.
        add_pools = set(ceph_names) - set(db_names)
        for pool_name in add_pools:
            # if is in ceph, not in db.
            # and not created by vsm.
            for pool in ceph_list:
                if pool['pool_name'] == pool_name:
                    storage_group = db.storage_group_get_by_rule_id(context, \
                                                    pool.get('crush_ruleset'))
                    values = {
                        'pool_id': pool.get('pool'),
                        'name': pool.get('pool_name'),
                        'pg_num': pool.get('pg_num'),
                        'pgp_num': pool.get('pg_placement_num'),
                        'size': pool.get('size'),
                        'min_size': pool.get('min_size'),
                        'crush_ruleset': pool.get('crush_ruleset'),
                        'crash_replay_interval': pool.get('crash_replay_interval')
                    }
                    values['created_by'] = 'ceph'
                    values['cluster_id'] = cluster_id
                    values['tag'] = 'SYSTEM'
                    values['status'] = 'running'
                    if storage_group:
                        values['primary_storage_group_id'] = storage_group['id']
                    LOG.debug('Pool %s in ceph, not in db, add it.' % values)
                    self._conductor_rpcapi.create_storage_pool(context, values)

When VSM discovers a pool created externally, it doesn't set auto_growth_pg, which defaults to 0 in the database when unset, meaning that pool will be rebalanced without user input

The final question:


What is the actual intention of the "Enable auto-tune PG" checkbox when creating a replicated pool?

Is this intended to indicate a feature that automatically rebalances storage pools when OSD's are added/removed from the cluster?  If this is the case, then the implementation is flawed, because not only is the value always nonzero for pools created by the GUI, so that an automatic rebalance will never occur (aside from the initial rebalance after creation since pools are always created with 64 PG's) but also, a zero value does not disable rebalances from occurring on pools that are NOT created by the GUI, so there is in fact no way to prevent discovered pools from being rebalanced

Is the checkbox instead just meant to simply indicate choosing the automatically calculated PG value in the GUI instead of designating one yourself?  If this is the case then the GUI is unclear, as it's implying a feature that does not exist, and also if the value is edited but then the "auto-tune" option is reselected, the auto-calculated value is not re-entered and not used
Reply | Threaded
Open this post in threaded view
|

Re: Storage pools rebalancing PG counts

ywang19
Administrator
In reply to this post by rdowell
Thanks for the lengthy and detailed analysis about  the issue.

To answer your question, the "auto-tune PG" is used to control if allowing VSM to change PG automatically. VSM will calculate one recommended value based on the storage group user chooses. if that's not user's preference, user could disable auto-tune, and assign one value to PG count. the auto-tune flag will trigger rebalancing at pool creation period. later on, when changes occur on cluster topology, rebalancing will re-occur to all pools. this is the general idea this feature.

-yaguang
Reply | Threaded
Open this post in threaded view
|

Re: Storage pools rebalancing PG counts

rdowell
This post was updated on .
Thanks Yaguang,

Based on your response, it sounds like the intention is indeed for auto-tune to be something that occurs over time with respect to cluster configuration, which was our initial assumption.  If that's the case, then looking at the code, I don't think the implementation is accomplishing that correctly.

From the code analysis I've done today, auto_growth_pg is always set to a nonzero value when creating a replicated pool, and this in turn causes the pool NOT to automatically adjust based on cluster configuration changes.  Whatever value is set in the UI when the pool is created, whether left with the automatically calculated recommended value or modified by the user, will be set to the 'auto_growth_pg' value, and this will be used as the pool's 'target' PG value forever.  The pool will only be rebalanced once, since they are always created with 64 PG's, a single rebalance will occur that updates it from 64 to whatever value was designated in the UI, but after that, no further rebalancing will occur regardless of whether the auto-tune checkbox was selected in the UI

This code segment is the culprit - the code that rebalances PG counts based on cluster config is in the 'else' block, which is executed only if auto_growth_pg is set to 0, which will only be the case for discovered (externally created by Ceph) and EC pools, and will never be true for replicated pools created in VSM.  In essence, this segment of the logic is backwards with respect to the stated intent of the feature.

source/vsm/vsm/agent/manager.py::AgentManager:update_pg_and_pgp()

            if auto_growth_pg:
                max_pg_num_finally = auto_growth_pg
            else:
                size = pool['size']
                if pool['size'] == 0:
                    pool_default_size = db.vsm_settings_get_by_name(context,'osd_pool_default_size')
                    size = int(pool_default_size.value)
                max_pg_num_finally = max_pg_num_per_osd * osd_num_per_group//size

I think there is a simple way to fix this so that the implementation matches the intended functionality.  I can create a pull request to outline what I believe needs to be fixed.