...
 
Commits (14)
  • Filipe Manana's avatar
    btrfs: fix log context list corruption after rename whiteout error · 236ebc20
    Filipe Manana authored
    During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
    we can end up returning from btrfs_rename() with the log context object
    still in the root's log context list - this happens if 'sync_log' was
    set to true before we called btrfs_whiteout_for_rename() and it is
    dangerous because we end up with a corrupt linked list (root->log_ctxs)
    as the log context object was allocated on the stack.
    
    After btrfs_rename() returns, any task that is running btrfs_sync_log()
    concurrently can end up crashing because that linked list is traversed by
    btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
    the same issue that commit e6c61710 ("Btrfs: fix log context list
    corruption after rename exchange operation") fixed.
    
    Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    236ebc20
  • Filipe Manana's avatar
    btrfs: fix removal of raid[56|1c34} incompat flags after removing block group · d8e6fd5c
    Filipe Manana authored
    We are incorrectly dropping the raid56 and raid1c34 incompat flags when
    there are still raid56 and raid1c34 block groups, not when we do not any
    of those anymore. The logic just got unintentionally broken after adding
    the support for the raid1c34 modes.
    
    Fix this by clear the flags only if we do not have block groups with the
    respective profiles.
    
    Fixes: 9c907446 ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    d8e6fd5c
  • Chunguang Xu's avatar
    memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event · 7d36665a
    Chunguang Xu authored
    An eventfd monitors multiple memory thresholds of the cgroup, closes them,
    the kernel deletes all events related to this eventfd.  Before all events
    are deleted, another eventfd monitors the memory threshold of this cgroup,
    leading to a crash:
    
      BUG: kernel NULL pointer dereference, address: 0000000000000004
      #PF: supervisor write access in kernel mode
      #PF: error_code(0x0002) - not-present page
      PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
      Oops: 0002 [#1] SMP PTI
      CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
      Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
      Workqueue: events memcg_event_remove
      RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
      RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
      RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
      RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
      R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
      R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
      Call Trace:
        memcg_event_remove+0x32/0x90
        process_one_work+0x172/0x380
        worker_thread+0x49/0x3f0
        kthread+0xf8/0x130
        ret_from_fork+0x35/0x40
      CR2: 0000000000000004
    
    We can reproduce this problem in the following ways:
    
    1. We create a new cgroup subdirectory and a new eventfd, and then we
       monitor multiple memory thresholds of the cgroup through this eventfd.
    
    2.  closing this eventfd, and __mem_cgroup_usage_unregister_event ()
       will be called multiple times to delete all events related to this
       eventfd.
    
    The first time __mem_cgroup_usage_unregister_event() is called, the
    kernel will clear all items related to this eventfd in thresholds->
    primary.
    
    Since there is currently only one eventfd, thresholds-> primary becomes
    empty, so the kernel will set thresholds-> primary and hresholds-> spare
    to NULL.  If at this time, the user creates a new eventfd and monitor
    the memory threshold of this cgroup, kernel will re-initialize
    thresholds-> primary.
    
    Then when __mem_cgroup_usage_unregister_event () is called for the
    second time, because thresholds-> primary is not empty, the system will
    access thresholds-> spare, but thresholds-> spare is NULL, which will
    trigger a crash.
    
    In general, the longer it takes to delete all events related to this
    eventfd, the easier it is to trigger this problem.
    
    The solution is to check whether the thresholds associated with the
    eventfd has been cleared when deleting the event.  If so, we do nothing.
    
    [akpm@linux-foundation.org: fix comment, per Kirill]
    Fixes: 907860ed ("cgroups: make cftype.unregister_event() void-returning")
    Signed-off-by: default avatarChunguang Xu <brookxu@tencent.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    7d36665a
  • Baoquan He's avatar
    mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case · d41e2f3b
    Baoquan He authored
    In section_deactivate(), pfn_to_page() doesn't work any more after
    ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case.  It
    causes a hot remove failure:
    
      kernel BUG at mm/page_alloc.c:4806!
      invalid opcode: 0000 [#1] SMP PTI
      CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G        W         5.5.0-next-20200205+ #340
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
      Workqueue: kacpi_hotplug acpi_hotplug_work_fn
      RIP: 0010:free_pages+0x85/0xa0
      Call Trace:
       __remove_pages+0x99/0xc0
       arch_remove_memory+0x23/0x4d
       try_remove_memory+0xc8/0x130
       __remove_memory+0xa/0x11
       acpi_memory_device_remove+0x72/0x100
       acpi_bus_trim+0x55/0x90
       acpi_device_hotplug+0x2eb/0x3d0
       acpi_hotplug_work_fn+0x1a/0x30
       process_one_work+0x1a7/0x370
       worker_thread+0x30/0x380
       kthread+0x112/0x130
       ret_from_fork+0x35/0x40
    
    Let's move the ->section_mem_map resetting after
    depopulate_section_memmap() to fix it.
    
    [akpm@linux-foundation.org: remove unneeded initialization, per David]
    Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
    Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
    Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Wei Yang <richardw.yang@linux.intel.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d41e2f3b
  • Qian Cai's avatar
    page-flags: fix a crash at SetPageError(THP_SWAP) · d72520ad
    Qian Cai authored
    Commit bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped
    out") supported writing THP to a swap device but forgot to upgrade an
    older commit df8c94d1 ("page-flags: define behavior of FS/IO-related
    flags on compound pages") which could trigger a crash during THP
    swapping out with DEBUG_VM_PGFLAGS=y,
    
      kernel BUG at include/linux/page-flags.h:317!
    
      page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
      page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0
      anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked)
    
      end_swap_bio_write()
        SetPageError(page)
          VM_BUG_ON_PAGE(1 && PageCompound(page))
    
      <IRQ>
      bio_endio+0x297/0x560
      dec_pending+0x218/0x430 [dm_mod]
      clone_endio+0xe4/0x2c0 [dm_mod]
      bio_endio+0x297/0x560
      blk_update_request+0x201/0x920
      scsi_end_request+0x6b/0x4b0
      scsi_io_completion+0x509/0x7e0
      scsi_finish_command+0x1ed/0x2a0
      scsi_softirq_done+0x1c9/0x1d0
      __blk_mqnterrupt+0xf/0x20
      </IRQ>
    
    Fix by checking PF_NO_TAIL in those places instead.
    
    Fixes: bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped out")
    Signed-off-by: default avatarQian Cai <cai@lca.pw>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
    Acked-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Acked-by: default avatarRafael Aquini <aquini@redhat.com>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/20200310235846.1319-1-cai@lca.pwSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d72520ad
  • Chris Down's avatar
    mm, memcg: fix corruption on 64-bit divisor in memory.high throttling · d397a45f
    Chris Down authored
    Commit 0e4b01df had a bunch of fixups to use the right division
    method.  However, it seems that after all that it still wasn't right --
    div_u64 takes a 32-bit divisor.
    
    The headroom is still large (2^32 pages), so on mundane systems you
    won't hit this, but this should definitely be fixed.
    
    Fixes: 0e4b01df ("mm, memcg: throttle allocators when failing reclaim over memory.high")
    Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarChris Down <chris@chrisdown.name>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Nathan Chancellor <natechancellor@gmail.com>
    Cc: <stable@vger.kernel.org>	[5.4.x+]
    Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d397a45f
  • Chris Down's avatar
    mm, memcg: throttle allocators based on ancestral memory.high · e26733e0
    Chris Down authored
    Prior to this commit, we only directly check the affected cgroup's
    memory.high against its usage.  However, it's possible that we are being
    reclaimed as a result of hitting an ancestor memory.high and should be
    penalised based on that, instead.
    
    This patch changes memory.high overage throttling to use the largest
    overage in its ancestors when considering how many penalty jiffies to
    charge.  This makes sure that we penalise poorly behaving cgroups in the
    same way regardless of at what level of the hierarchy memory.high was
    breached.
    
    Fixes: 0e4b01df ("mm, memcg: throttle allocators when failing reclaim over memory.high")
    Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarChris Down <chris@chrisdown.name>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Nathan Chancellor <natechancellor@gmail.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: <stable@vger.kernel.org>	[5.4.x+]
    Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    e26733e0
  • Michal Hocko's avatar
    mm: do not allow MADV_PAGEOUT for CoW pages · 12e967fd
    Michal Hocko authored
    Jann has brought up a very interesting point [1].  While shared pages
    are excluded from MADV_PAGEOUT normally, CoW pages can be easily
    reclaimed that way.  This can lead to all sorts of hard to debug
    problems.  E.g.  performance problems outlined by Daniel [2].
    
    There are runtime environments where there is a substantial memory
    shared among security domains via CoW memory and a easy to reclaim way
    of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
    performance degradation in for the parent process which might be more
    privileged or even open side channel attacks.
    
    The feasibility of the latter is not really clear to me TBH but there is
    no real reason for exposure at this stage.  It seems there is no real
    use case to depend on reclaiming CoW memory via madvise at this stage so
    it is much easier to simply disallow it and this is what this patch
    does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
    exclusively owned memory which is a straightforward semantic.
    
    [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
    [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
    
    Fixes: 9c276cc6 ("mm: introduce MADV_COLD")
    Reported-by: default avatarJann Horn <jannh@google.com>
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Daniel Colascione <dancol@google.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.czSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    12e967fd
  • Roman Penyaev's avatar
    epoll: fix possible lost wakeup on epoll_ctl() path · 1b53734b
    Roman Penyaev authored
    This fixes possible lost wakeup introduced by commit a218cc49.
    Originally modifications to ep->wq were serialized by ep->wq.lock, but
    in commit a218cc49 ("epoll: use rwlock in order to reduce
    ep_poll_callback() contention") a new rw lock was introduced in order to
    relax fd event path, i.e. callers of ep_poll_callback() function.
    
    After the change ep_modify and ep_insert (both are called on epoll_ctl()
    path) were switched to ep->lock, but ep_poll (epoll_wait) was using
    ep->wq.lock on wqueue list modification.
    
    The bug doesn't lead to any wqueue list corruptions, because wake up
    path and list modifications were serialized by ep->wq.lock internally,
    but actual waitqueue_active() check prior wake_up() call can be
    reordered with modifications of ep ready list, thus wake up can be lost.
    
    And yes, can be healed by explicit smp_mb():
    
      list_add_tail(&epi->rdlink, &ep->rdllist);
      smp_mb();
      if (waitqueue_active(&ep->wq))
    	wake_up(&ep->wp);
    
    But let's make it simple, thus current patch replaces ep->wq.lock with
    the ep->lock for wqueue modifications, thus wake up path always observes
    activeness of the wqueue correcty.
    
    Fixes: a218cc49 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
    Reported-by: default avatarMax Neunhoeffer <max@arangodb.com>
    Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarMax Neunhoeffer <max@arangodb.com>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io>
    Cc: Davidlohr Bueso <dbueso@suse.de>
    Cc: Jason Baron <jbaron@akamai.com>
    Cc: Jes Sorensen <jes.sorensen@gmail.com>
    Cc: <stable@vger.kernel.org>	[5.1+]
    Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
    References: https://bugzilla.kernel.org/show_bug.cgi?id=205933Bisected-by: default avatarMax Neunhoeffer <max@arangodb.com>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    1b53734b
  • Qian Cai's avatar
    mm/mmu_notifier: silence PROVE_RCU_LIST warnings · 63886bad
    Qian Cai authored
    It is safe to traverse mm->notifier_subscriptions->list either under
    SRCU read lock or mm->notifier_subscriptions->lock using
    hlist_for_each_entry_rcu().  Silence the PROVE_RCU_LIST false positives,
    for example,
    
      WARNING: suspicious RCU usage
      -----------------------------
      mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!
    
      other info that might help us debug this:
    
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by libvirtd/802:
       #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
       #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
       #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460
    
      stack backtrace:
      CPU: 7 PID: 802 Comm: libvirtd Tainted: G          I       5.6.0-rc6-next-20200317+ #2
      Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
      Call Trace:
        dump_stack+0xa4/0xfe
        lockdep_rcu_suspicious+0xeb/0xf5
        __mmu_notifier_invalidate_range_start+0x3ff/0x460
        change_p4d_range+0x746/0x800
        change_protection+0x1df/0x300
        mprotect_fixup+0x245/0x3e0
        do_mprotect_pkey+0x23b/0x3e0
        __x64_sys_mprotect+0x51/0x70
        do_syscall_64+0x91/0xae8
        entry_SYSCALL_64_after_hwframe+0x49/0xb3
    Signed-off-by: default avatarQian Cai <cai@lca.pw>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
    Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pwSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    63886bad
  • Vlastimil Babka's avatar
    mm, slub: prevent kmalloc_node crashes and memory leaks · 0715e6c5
    Vlastimil Babka authored
    Sachin reports [1] a crash in SLUB __slab_alloc():
    
      BUG: Kernel NULL pointer dereference on read at 0x000073b0
      Faulting instruction address: 0xc0000000003d55f4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
      Modules linked in:
      CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
      NIP:  c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
      REGS: c0000008b37836d0 TRAP: 0300   Not tainted  (5.6.0-rc2-next-20200218-autotest)
      MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004844  XER: 00000000
      CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
      GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
      GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
      GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
      GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
      GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
      GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
      GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
      GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
      NIP ___slab_alloc+0x1f4/0x760
      LR __slab_alloc+0x34/0x60
      Call Trace:
        ___slab_alloc+0x334/0x760 (unreliable)
        __slab_alloc+0x34/0x60
        __kmalloc_node+0x110/0x490
        kvmalloc_node+0x58/0x110
        mem_cgroup_css_online+0x108/0x270
        online_css+0x48/0xd0
        cgroup_apply_control_enable+0x2ec/0x4d0
        cgroup_mkdir+0x228/0x5f0
        kernfs_iop_mkdir+0x90/0xf0
        vfs_mkdir+0x110/0x230
        do_mkdirat+0xb0/0x1a0
        system_call+0x5c/0x68
    
    This is a PowerPC platform with following NUMA topology:
    
      available: 2 nodes (0-1)
      node 0 cpus:
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
      node 1 size: 35247 MB
      node 1 free: 30907 MB
      node distances:
      node   0   1
        0:  10  40
        1:  40  10
    
      possible numa nodes: 0-31
    
    This only happens with a mmotm patch "mm/memcontrol.c: allocate
    shrinker_map on appropriate NUMA node" [2] which effectively calls
    kmalloc_node for each possible node.  SLUB however only allocates
    kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
    node_to_mem_node to return such valid node for other nodes since commit
    a561ce00 ("slub: fall back to node_to_mem_node() node if allocating
    on memoryless node").  This is however not true in this configuration
    where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
    thus it contains zeroes and get_partial() ends up accessing
    non-allocated kmem_cache_node.
    
    A related issue was reported by Bharata (originally by Ramachandran) [3]
    where a similar PowerPC configuration, but with mainline kernel without
    patch [2] ends up allocating large amounts of pages by kmalloc-1k
    kmalloc-512.  This seems to have the same underlying issue with
    node_to_mem_node() not behaving as expected, and might probably also
    lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
    
    This patch should fix both issues by not relying on node_to_mem_node()
    anymore and instead simply falling back to NUMA_NO_NODE, when
    kmalloc_node(node) is attempted for a node that's not online, or has no
    usable memory.  The "usable memory" condition is also changed from
    node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
    the condition that SLUB uses to allocate kmem_cache_node structures.
    The check in get_partial() is removed completely, as the checks in
    ___slab_alloc() are now sufficient to prevent get_partial() being
    reached with an invalid node.
    
    [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
    [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
    [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
    [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
    
    Fixes: a561ce00 ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
    Reported-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
    Reported-by: default avatarPUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
    Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
    Tested-by: default avatarBharata B Rao <bharata@linux.ibm.com>
    Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Christopher Lameter <cl@linux.com>
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Nathan Lynch <nathanl@linux.ibm.com>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.czDebugged-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    0715e6c5
  • Joerg Roedel's avatar
    x86/mm: split vmalloc_sync_all() · 763802b5
    Joerg Roedel authored
    Commit 3f8fd02b ("mm/vmalloc: Sync unmappings in
    __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
    the vunmap() code-path.  While this change was necessary to maintain
    correctness on x86-32-pae kernels, it also adds additional cycles for
    architectures that don't need it.
    
    Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
    severe performance regressions in micro-benchmarks because it now also
    calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
    the vmalloc_sync_all() implementation on x86-64 is only needed for newly
    created mappings.
    
    To avoid the unnecessary work on x86-64 and to gain the performance
    back, split up vmalloc_sync_all() into two functions:
    
    	* vmalloc_sync_mappings(), and
    	* vmalloc_sync_unmappings()
    
    Most call-sites to vmalloc_sync_all() only care about new mappings being
    synchronized.  The only exception is the new call-site added in the
    above mentioned commit.
    
    Shile Zhang directed us to a report of an 80% regression in reaim
    throughput.
    
    Fixes: 3f8fd02b ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
    Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
    Reported-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
    Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarBorislav Petkov <bp@suse.de>
    Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: <stable@vger.kernel.org>
    Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
    Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
    Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    763802b5
  • Linus Torvalds's avatar
    Merge branch 'akpm' (patches from Andrew) · b3c03db6
    Linus Torvalds authored
    Merge misc fixes from Andrew Morton:
     "10 fixes"
    
    * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
      x86/mm: split vmalloc_sync_all()
      mm, slub: prevent kmalloc_node crashes and memory leaks
      mm/mmu_notifier: silence PROVE_RCU_LIST warnings
      epoll: fix possible lost wakeup on epoll_ctl() path
      mm: do not allow MADV_PAGEOUT for CoW pages
      mm, memcg: throttle allocators based on ancestral memory.high
      mm, memcg: fix corruption on 64-bit divisor in memory.high throttling
      page-flags: fix a crash at SetPageError(THP_SWAP)
      mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
      memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
    b3c03db6
  • Linus Torvalds's avatar
    Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 67d584e3
    Linus Torvalds authored
    Pull btrfs fixes from David Sterba:
     "Two fixes.
    
      The first is a regression: when dropping some incompat bits the
      conditions were reversed. The other is a fix for rename whiteout
      potentially leaving stack memory linked to a list"
    
    * tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
      btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
      btrfs: fix log context list corruption after rename whiteout error
    67d584e3
......@@ -190,7 +190,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
return pmd_k;
}
void vmalloc_sync_all(void)
static void vmalloc_sync(void)
{
unsigned long address;
......@@ -217,6 +217,16 @@ void vmalloc_sync_all(void)
}
}
void vmalloc_sync_mappings(void)
{
vmalloc_sync();
}
void vmalloc_sync_unmappings(void)
{
vmalloc_sync();
}
/*
* 32-bit:
*
......@@ -319,11 +329,23 @@ static void dump_pagetable(unsigned long address)
#else /* CONFIG_X86_64: */
void vmalloc_sync_all(void)
void vmalloc_sync_mappings(void)
{
/*
* 64-bit mappings might allocate new p4d/pud pages
* that need to be propagated to all tasks' PGDs.
*/
sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END);
}
void vmalloc_sync_unmappings(void)
{
/*
* Unmappings never allocate or free p4d/pud pages.
* No work is required here.
*/
}
/*
* 64-bit:
*
......
......@@ -171,7 +171,7 @@ int ghes_estatus_pool_init(int num_ghes)
* New allocation must be visible in all pgd before it can be found by
* an NMI allocating from the pool.
*/
vmalloc_sync_all();
vmalloc_sync_mappings();
rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
if (rc)
......
......@@ -856,9 +856,9 @@ static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
found_raid1c34 = true;
up_read(&sinfo->groups_sem);
}
if (found_raid56)
if (!found_raid56)
btrfs_clear_fs_incompat(fs_info, RAID56);
if (found_raid1c34)
if (!found_raid1c34)
btrfs_clear_fs_incompat(fs_info, RAID1C34);
}
}
......
......@@ -9496,6 +9496,10 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
ret = btrfs_sync_log(trans, BTRFS_I(old_inode)->root, &ctx);
if (ret)
commit_transaction = true;
} else if (sync_log) {
mutex_lock(&root->log_mutex);
list_del(&ctx.list);
mutex_unlock(&root->log_mutex);
}
if (commit_transaction) {
ret = btrfs_commit_transaction(trans);
......
......@@ -1854,9 +1854,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
waiter = true;
init_waitqueue_entry(&wait, current);
spin_lock_irq(&ep->wq.lock);
write_lock_irq(&ep->lock);
__add_wait_queue_exclusive(&ep->wq, &wait);
spin_unlock_irq(&ep->wq.lock);
write_unlock_irq(&ep->lock);
}
for (;;) {
......@@ -1904,9 +1904,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
goto fetch_events;
if (waiter) {
spin_lock_irq(&ep->wq.lock);
write_lock_irq(&ep->lock);
__remove_wait_queue(&ep->wq, &wait);
spin_unlock_irq(&ep->wq.lock);
write_unlock_irq(&ep->lock);
}
return res;
......
......@@ -311,7 +311,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
__PAGEFLAG(Locked, locked, PF_NO_TAIL)
PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD)
PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL)
PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
__SETPAGEFLAG(Referenced, referenced, PF_HEAD)
......
......@@ -141,8 +141,9 @@ extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
void vmalloc_sync_all(void);
void vmalloc_sync_mappings(void);
void vmalloc_sync_unmappings(void);
/*
* Lowlevel-APIs (not for driver use!)
*/
......
......@@ -519,7 +519,7 @@ NOKPROBE_SYMBOL(notify_die);
int register_die_notifier(struct notifier_block *nb)
{
vmalloc_sync_all();
vmalloc_sync_mappings();
return atomic_notifier_chain_register(&die_chain, nb);
}
EXPORT_SYMBOL_GPL(register_die_notifier);
......
......@@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
}
page = pmd_page(orig_pmd);
/* Do not interfere with other mappings of this page */
if (page_mapcount(page) != 1)
goto huge_unlock;
if (next - addr != HPAGE_PMD_SIZE) {
int err;
if (page_mapcount(page) != 1)
goto huge_unlock;
get_page(page);
spin_unlock(ptl);
lock_page(page);
......@@ -426,6 +428,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
continue;
}
/* Do not interfere with other mappings of this page */
if (page_mapcount(page) != 1)
continue;
VM_BUG_ON_PAGE(PageTransCompound(page), page);
if (pte_young(ptent)) {
......
......@@ -2297,28 +2297,41 @@ static void high_work_func(struct work_struct *work)
#define MEMCG_DELAY_SCALING_SHIFT 14
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
* Get the number of jiffies that we should penalise a mischievous cgroup which
* is exceeding its memory.high by checking both it and its ancestors.
*/
void mem_cgroup_handle_over_high(void)
static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
unsigned int nr_pages)
{
unsigned long usage, high, clamped_high;
unsigned long pflags;
unsigned long penalty_jiffies, overage;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
struct mem_cgroup *memcg;
unsigned long penalty_jiffies;
u64 max_overage = 0;
if (likely(!nr_pages))
return;
do {
unsigned long usage, high;
u64 overage;
memcg = get_mem_cgroup_from_mm(current->mm);
reclaim_high(memcg, nr_pages, GFP_KERNEL);
current->memcg_nr_pages_over_high = 0;
usage = page_counter_read(&memcg->memory);
high = READ_ONCE(memcg->high);
/*
* Prevent division by 0 in overage calculation by acting as if
* it was a threshold of 1 page
*/
high = max(high, 1UL);
overage = usage - high;
overage <<= MEMCG_DELAY_PRECISION_SHIFT;
overage = div64_u64(overage, high);
if (overage > max_overage)
max_overage = overage;
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
if (!max_overage)
return 0;
/*
* memory.high is breached and reclaim is unable to keep up. Throttle
* allocators proactively to slow down excessive growth.
*
* We use overage compared to memory.high to calculate the number of
* jiffies to sleep (penalty_jiffies). Ideally this value should be
* fairly lenient on small overages, and increasingly harsh when the
......@@ -2326,24 +2339,9 @@ void mem_cgroup_handle_over_high(void)
* its crazy behaviour, so we exponentially increase the delay based on
* overage amount.
*/
usage = page_counter_read(&memcg->memory);
high = READ_ONCE(memcg->high);
if (usage <= high)
goto out;
/*
* Prevent division by 0 in overage calculation by acting as if it was a
* threshold of 1 page
*/
clamped_high = max(high, 1UL);
overage = div_u64((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT,
clamped_high);
penalty_jiffies = ((u64)overage * overage * HZ)
>> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
penalty_jiffies = max_overage * max_overage * HZ;
penalty_jiffies >>= MEMCG_DELAY_PRECISION_SHIFT;
penalty_jiffies >>= MEMCG_DELAY_SCALING_SHIFT;
/*
* Factor in the task's own contribution to the overage, such that four
......@@ -2360,7 +2358,32 @@ void mem_cgroup_handle_over_high(void)
* application moving forwards and also permit diagnostics, albeit
* extremely slowly.
*/
penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
}
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
*/
void mem_cgroup_handle_over_high(void)
{
unsigned long penalty_jiffies;
unsigned long pflags;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
struct mem_cgroup *memcg;
if (likely(!nr_pages))
return;
memcg = get_mem_cgroup_from_mm(current->mm);
reclaim_high(memcg, nr_pages, GFP_KERNEL);
current->memcg_nr_pages_over_high = 0;
/*
* memory.high is breached and reclaim is unable to keep up. Throttle
* allocators proactively to slow down excessive growth.
*/
penalty_jiffies = calculate_high_delay(memcg, nr_pages);
/*
* Don't sleep if the amount of jiffies this memcg owes us is so low
......@@ -4027,7 +4050,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
struct mem_cgroup_thresholds *thresholds;
struct mem_cgroup_threshold_ary *new;
unsigned long usage;
int i, j, size;
int i, j, size, entries;
mutex_lock(&memcg->thresholds_lock);
......@@ -4047,14 +4070,20 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
__mem_cgroup_threshold(memcg, type == _MEMSWAP);
/* Calculate new number of threshold */
size = 0;
size = entries = 0;
for (i = 0; i < thresholds->primary->size; i++) {
if (thresholds->primary->entries[i].eventfd != eventfd)
size++;
else
entries++;
}
new = thresholds->spare;
/* If no items related to eventfd have been cleared, nothing to do */
if (!entries)
goto unlock;
/* Set thresholds array to NULL if we don't have thresholds */
if (!size) {
kfree(new);
......
......@@ -307,7 +307,8 @@ static void mn_hlist_release(struct mmu_notifier_subscriptions *subscriptions,
* ->release returns.
*/
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist)
hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist,
srcu_read_lock_held(&srcu))
/*
* If ->release runs before mmu_notifier_unregister it must be
* handled, as it's the only way for the driver to flush all
......@@ -370,7 +371,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist) {
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->clear_flush_young)
young |= subscription->ops->clear_flush_young(
subscription, mm, start, end);
......@@ -389,7 +391,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist) {
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->clear_young)
young |= subscription->ops->clear_young(subscription,
mm, start, end);
......@@ -407,7 +410,8 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist) {
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->test_young) {
young = subscription->ops->test_young(subscription, mm,
address);
......@@ -428,7 +432,8 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist) {
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->change_pte)
subscription->ops->change_pte(subscription, mm, address,
pte);
......@@ -476,7 +481,8 @@ static int mn_hlist_invalidate_range_start(
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist) {
hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
const struct mmu_notifier_ops *ops = subscription->ops;
if (ops->invalidate_range_start) {
......@@ -528,7 +534,8 @@ mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist) {
hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
/*
* Call invalidate_range here too to avoid the need for the
* subsystem of having to register an invalidate_range_end
......@@ -582,7 +589,8 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist) {
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->invalidate_range)
subscription->ops->invalidate_range(subscription, mm,
start, end);
......@@ -714,7 +722,8 @@ find_get_mmu_notifier(struct mm_struct *mm, const struct mmu_notifier_ops *ops)
spin_lock(&mm->notifier_subscriptions->lock);
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist) {
&mm->notifier_subscriptions->list, hlist,
lockdep_is_held(&mm->notifier_subscriptions->lock)) {
if (subscription->ops != ops)
continue;
......
......@@ -370,10 +370,14 @@ void vm_unmap_aliases(void)
EXPORT_SYMBOL_GPL(vm_unmap_aliases);
/*
* Implement a stub for vmalloc_sync_all() if the architecture chose not to
* have one.
* Implement a stub for vmalloc_sync_[un]mapping() if the architecture
* chose not to have one.
*/
void __weak vmalloc_sync_all(void)
void __weak vmalloc_sync_mappings(void)
{
}
void __weak vmalloc_sync_unmappings(void)
{
}
......
......@@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
else if (!node_present_pages(node))
searchnode = node_to_mem_node(node);
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
......@@ -2563,17 +2561,27 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
struct page *page;
page = c->page;
if (!page)
if (!page) {
/*
* if the node is not online or has no normal memory, just
* ignore the node constraint
*/
if (unlikely(node != NUMA_NO_NODE &&
!node_state(node, N_NORMAL_MEMORY)))
node = NUMA_NO_NODE;
goto new_slab;
}
redo:
if (unlikely(!node_match(page, node))) {
int searchnode = node;
if (node != NUMA_NO_NODE && !node_present_pages(node))
searchnode = node_to_mem_node(node);
if (unlikely(!node_match(page, searchnode))) {
/*
* same as above but node_match() being false already
* implies node != NUMA_NO_NODE
*/
if (!node_state(node, N_NORMAL_MEMORY)) {
node = NUMA_NO_NODE;
goto redo;
} else {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
......
......@@ -734,6 +734,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
struct mem_section *ms = __pfn_to_section(pfn);
bool section_is_early = early_section(ms);
struct page *memmap = NULL;
bool empty;
unsigned long *subsection_map = ms->usage
? &ms->usage->subsection_map[0] : NULL;
......@@ -764,7 +765,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
* For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified
*/
bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION);
if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
empty = bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION);
if (empty) {
unsigned long section_nr = pfn_to_section_nr(pfn);
/*
......@@ -779,13 +781,15 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
ms->usage = NULL;
}
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
ms->section_mem_map = (unsigned long)NULL;
}
if (section_is_early && memmap)
free_map_bootmem(memmap);
else
depopulate_section_memmap(pfn, nr_pages, altmap);
if (empty)
ms->section_mem_map = (unsigned long)NULL;
}
static struct page * __meminit section_activate(int nid, unsigned long pfn,
......
......@@ -1295,7 +1295,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
* First make sure the mappings are removed from all page-tables
* before they are freed.
*/
vmalloc_sync_all();
vmalloc_sync_unmappings();
/*
* TODO: to calculate a flush range without looping.
......@@ -3128,16 +3128,19 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
EXPORT_SYMBOL(remap_vmalloc_range);
/*
* Implement a stub for vmalloc_sync_all() if the architecture chose not to
* have one.
* Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose
* not to have one.
*
* The purpose of this function is to make sure the vmalloc area
* mappings are identical in all page-tables in the system.
*/
void __weak vmalloc_sync_all(void)
void __weak vmalloc_sync_mappings(void)
{
}
void __weak vmalloc_sync_unmappings(void)
{
}
static int f(pte_t *pte, unsigned long addr, void *data)
{
......